Research Article Precomputed Clustering for Movie...

10
Research Article Precomputed Clustering for Movie Recommendation System in Real Time Bo Li, 1 Yibin Liao, 1 and Zheng Qin 2 1 Department of Computer Science, University of Georgia, Athens, GA 30602, USA 2 School of Soſtware, Tsinghua University, Beijing 100084, China Correspondence should be addressed to Zheng Qin; [email protected] Received 7 February 2014; Accepted 12 May 2014; Published 25 June 2014 Academic Editor: Guiming Luo Copyright © 2014 Bo Li et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. A recommendation system delivers customized data (articles, news, images, music, movies, etc.) to its users. As the interest of recommendation systems grows, we started working on the movie recommendation systems. Most research efforts in the fields of movie recommendation system are focusing on discovering the most relevant features from users, or seeking out users who share same tastes as that of the given user as well as recommending the movies according to the liking of these sought users or seeking out users who share a connection with other people (friends, classmates, colleagues, etc.) and make recommendations based on those related people’s tastes. However, little research has focused on recommending movies based on the movie’s features. In this paper, we present a novel idea that applies machine learning techniques to construct a cluster for the movie by implementing a distance matrix based on the movie features and then make movie recommendation in real time. We implement some different clustering methods and evaluate their performance in a real movie forum website owned by one of our authors. is idea can also be used in other types of recommendation systems such as music, news, and articles. 1. Introduction Movie recommendation systems suggest movies to user that he/she might be interested in. e generated suggestions are obtained from the consideration of many aspects. e suggestion can be based on the tastes, interests, goals, people connections, and so forth. In general, a movie recommendation system compares user’s profile or usage data to some reference characteristics and combines the user’s social environment to make movie recommendations. is type of recommendations is based on user. However, this type of recommendations may not work or make inaccurate recommendation in the following situations. e user does not have strong profile setting in the system. ere are many users who do not want to set their profile due to laziness or privacy concerns. In this case, most recommendation systems consider the user’s social connec- tion in the system such as friends, classmates, families, and colleagues. However, the tastes of the movies may be various even among best friends. What is worse, the “friends” that the user has added in the website may not be people with the same interest. For example, in a community network or local network such as a university or college, like our exper- imental environment, users’ social connections are built mainly because they are in the same university with the same major and similar age; however, their tastes towards movies may be totally different, which fails the fundamental bias in the recommendation system mentioned above. Because of the situation stated above, we propose the use of machine learning (ML) techniques (clustering) to analyze the movie features and system logs (user’s voting logs) to make correct recommendations more adequately. In this proposed approach, we compute a distance matrix for the movie features and apply the clustering techniques to classify movies into different areas off-line. For every user logged on the system, we recommend movies from the clusters com- bined with the user’s majority voting result in real time. In order to compare the accuracy and efficiency, we have implemented different clustering techniques as follows: Hindawi Publishing Corporation Journal of Applied Mathematics Volume 2014, Article ID 742341, 9 pages http://dx.doi.org/10.1155/2014/742341

Transcript of Research Article Precomputed Clustering for Movie...

Page 1: Research Article Precomputed Clustering for Movie ...downloads.hindawi.com/journals/jam/2014/742341.pdfResearch Article Precomputed Clustering for Movie Recommendation System in Real

Research ArticlePrecomputed Clustering for Movie RecommendationSystem in Real Time

Bo Li1 Yibin Liao1 and Zheng Qin2

1 Department of Computer Science University of Georgia Athens GA 30602 USA2 School of Software Tsinghua University Beijing 100084 China

Correspondence should be addressed to Zheng Qin qingzhtsinghuaeducn

Received 7 February 2014 Accepted 12 May 2014 Published 25 June 2014

Academic Editor Guiming Luo

Copyright copy 2014 Bo Li et al This is an open access article distributed under the Creative Commons Attribution License whichpermits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

A recommendation system delivers customized data (articles news images music movies etc) to its users As the interest ofrecommendation systems grows we started working on the movie recommendation systems Most research efforts in the fields ofmovie recommendation system are focusing on discovering the most relevant features from users or seeking out users who sharesame tastes as that of the given user as well as recommending themovies according to the liking of these sought users or seeking outusers who share a connection with other people (friends classmates colleagues etc) and make recommendations based on thoserelated peoplersquos tastes However little research has focused on recommending movies based on the moviersquos features In this paperwe present a novel idea that applies machine learning techniques to construct a cluster for the movie by implementing a distancematrix based on the movie features and then make movie recommendation in real time We implement some different clusteringmethods and evaluate their performance in a real movie forum website owned by one of our authors This idea can also be used inother types of recommendation systems such as music news and articles

1 Introduction

Movie recommendation systems suggest movies to user thatheshe might be interested in The generated suggestionsare obtained from the consideration of many aspects Thesuggestion can be based on the tastes interests goals peopleconnections and so forth

In general a movie recommendation system comparesuserrsquos profile or usage data to some reference characteristicsand combines the userrsquos social environment to make movierecommendations This type of recommendations is basedon user However this type of recommendations may notwork or make inaccurate recommendation in the followingsituationsThe user does not have strong profile setting in thesystem There are many users who do not want to set theirprofile due to laziness or privacy concerns In this case mostrecommendation systems consider the userrsquos social connec-tion in the system such as friends classmates families andcolleagues However the tastes of the movies may be variouseven among best friends What is worse the ldquofriendsrdquo that

the user has added in the website may not be people withthe same interest For example in a community network orlocal network such as a university or college like our exper-imental environment usersrsquo social connections are builtmainly because they are in the same university with the samemajor and similar age however their tastes towards moviesmay be totally different which fails the fundamental bias inthe recommendation system mentioned above

Because of the situation stated above we propose the useof machine learning (ML) techniques (clustering) to analyzethe movie features and system logs (userrsquos voting logs) tomake correct recommendations more adequately In thisproposed approach we compute a distance matrix for themovie features and apply the clustering techniques to classifymovies into different areas off-line For every user logged onthe system we recommend movies from the clusters com-bined with the userrsquos majority voting result in real time

In order to compare the accuracy and efficiency wehave implemented different clustering techniques as follows

Hindawi Publishing CorporationJournal of Applied MathematicsVolume 2014 Article ID 742341 9 pageshttpdxdoiorg1011552014742341

2 Journal of Applied Mathematics

Distance computation

systemPreclustering

systemReal time online recommendation

system

Figure 1 System overview

DBSCAN (density-based clustering) affinity propagationhierarchical clustering and random clustering as a base line

We have tested this proposal with all the clusteringtechniques in a university social website owned by one ofour authors There is a subsystem with movie service in thewebsite We add a recommendation function in a banner andput it on top of the section to let user make votes We offerthem all the clustering results from four cluster algorithmsshuffled together and record the voting results separately

The rest of the paper is organized as follows Section 2gives an overview of related works in whichmachine learningtechniques have been applied to recommendation systemsSection 3 is the system overview of our recommendationsystem Section 4 illuminates the distance functions we useto compute the distance matrix Section 5 introduces all theclustering techniques used in our preclustering systemSection 6 includes the experimental setup and evaluation re-sults Finally Section 7 concludes with discussions and futureresearch

2 Related Work

Machine learning techniques are very useful when hugeamounts of data have to be classified and analyzed whichnowadays is a very common situation in many scenarios es-pecially in the recommendation system There are a lot ofdifferent machine learning techniques used in different rec-ommendation systems such as Naive Bayes classification [1]decision tree [2] k-means clustering and improvement [3ndash6]and so forth

Generally the recommendation systems are divided intotwo major categories such as collaborative recommendationsystem and content based recommendation system [7] Incase of collaborative recommendation systems these try toseek out users who share same tastes as that of the given useras well as to recommend themovies according to the liking ofthese sought users For example Tatli and Birturk describedan approach for creating music recommendations based onuser-supplied tags that are augmented with a hierarchicalstructure extracted for top level genres from Dbpedia Inthis structure each genre is represented by its stylistic ori-gins typical instruments derivative forms subgenres and fu-sion genresThey use this well-organized structure in dimen-sionality reduction in user profiling [8] Yang et al proposeda personalized web page recommendation model calledPIGEON (abbreviation for PersonalIzed web paGe rEcom-mendatiON) via collaborative filtering and a topic-aware

Markov model and proposed a graph-based iteration algo-rithm to discover userrsquos interested topics based on whichuser similarities are measured [9] Cai et al also proposed amodel that fully captures the bilateral role of user interactionswithin a social network and formulated collaborative filteringmethods to enable people to people recommendation [10] AsI mentioned in Section 1 making recommendation based onuser is not working properly in some particular areas such asa community network system or university network

The content based recommendation systems try to rec-ommend contents similar to thoseweb sites the user has likedBiancalana et al proposed two different context-aware ap-proaches for the movie recommendation task one is a hy-brid recommender that assesses available contextual factorsrelated to time in order to increase the performance of tradi-tional CF approaches and the other one aims at identifyingusers in a household that submitted a given rating [11] Thislatter approach is based on machine learning techniquesnamely neural networks and majority voting In our paperwe focus on the content based recommendation and usethe majority voting and clustering techniques We obtainedbetter results compared to their research whichwill be shownin Section 6

Some of the researchers proposed hybrid recommenda-tion approaches by combining different approaches Ghazan-far and Prugel-Bennett proposed a unique switching hybridrecommendation approach by combining a Naive Bayes clas-sification approach with the collaborative filtering [1] Ujwalaet al presented and investigated an approach based onweighted Association Rule Mining Algorithm and text min-ing [7] Bellogın et al implemented decision trees and at-tribute selections together to build the recommendationmodel to find the most relevant preferences of user and sys-tem [2] The idea of hybrid recommendation approach givesus inspiration to combine different approaches to implementthe movie recommendation for future work

3 System Overview

Figure 1 provides a high-level overview of our newmovie rec-ommendation approachAs shown in Figure 1 the newmovierecommendation system comes in three parts distance com-putation system preclustering system and real time onlinerecommendation system

31 Distance Computation System Distance computationsystem computes the distance between differentmovies based

Journal of Applied Mathematics 3

+

+++

+

++++

minus

C1 C2

C3

C4

Figure 2 A demo of voting The probability that recommendedmovies are in 119862

1is 58

on different properties of the movies including the moviestypes publish year countries of publishing companies lan-guages directors casts and duration time We first computethe Jaccard distance based onmovies types countries of pub-lishing companies languages directors and casts and thenthe distance based on publish year and duration time by thedistance we defined in Sections 422 and 423We then com-pute the overall distances between each movie sample bysummarizing them together with the weights which we ob-tained from the survey mentioned in Section 41

32 Preclustering System Preclustering system separatesmovies into different clusters before giving them to onlinerecommendation system In this system we tried fivekinds of different clustering algorithms affinity propagationDBSCAN hierarchical clustering spectral clustering andrandom generator among which spectral clustering is tooslow to get the result in demanded time Therefore we sentthe remaining four clustering results to the recommendationsystem with two data formats one can get the cluster label byone moviersquos IMDB identifier and another one can drag all themoviesrsquo IMDB identifiers by one of cluster labels In this waythe online recommendation system in Section 33 can quicklyget the information it wants

33 Real Time Online Recommendation System After gettingthe preclustering results the website read the history of usersrsquovotes on movies and used the majority of the voting togenerate recommendations Inmore detail if one user119880

119894likes

movie1198721198951which belongs to cluster 119862

1198961 119880119894will give 119862

1198961one

V119900119905119890 on 1198621198961 On contract if 119880

119894dislikes movie 119872

1198952in 1198621198962

119880119894will cancel a V119900119905119890 for 119862

1198962 If a cluster gets negative V119900119905119890

at the end it will return to 0 After 119880119894votes all the movies

heshe comments different clusters will get different votesThe movie recommendation system will recommend moviesaccording to the votes of clusters 119862

119895as 119881(119862

119895) as follows

119875 (119862119894) =

119881 (119862119894)

sum119899

119895=1119881(119862119895)

(1)

where 119875(119862119894) is the probability that a movie should be recom-

mended from cluster 119862119894and 119881(119862

119894) is the number of votes in

cluster 119862119894 119899 is the number of clusters As shown in Figure 2

To make sure that in our evaluation the reason that uservotes a movie we recommended as dislike is not because themovie itself is really bad we recommend movies with theIMDB score higher than 7 in the clusters which have alreadybeen voted out by the user

4 Distance Computation System

41 Weight Survey Moviesrsquo properties create a very specialspace where the weights of each dimension are treated com-pletely different For example the type of a movie is of coursemore important than the duration of it Therefore we cannotuse the default distance function provided by Scikit-learn [18]package that we use Instead the feature selection like in [2]or starting a survey to figure out the weights to different di-mensions should be implemented In other research areaswhere we human beings do not knowwhich feature is impor-tant such as features in carcinogenic we always use featureselection to decide the weights of different features In movierecommendation area however we can investigate it by thesurvey shown in Figure 3 since people knowwhy they like themovie By carefully designing a survey and letting users votethe top three factors we obtained a result shown in Figure 4

From the survey result shown in Figure 4 we can figureout that the weight of publish year is 00915 weight of countryis 0147 weight of type is 04167 weight of language is 00545weight of director is 00896 weight of casts is 01792 andweight of duration is 00214These weights will be used in theoverall distance computation in Section 43

42 Distance Function

421 Jaccard Distance Since for countries languages castsand types of the movie the distance is determined by howmany same and different items there are between sets wedecided to use Jaccard distance proposed in [12] Jaccard dis-tance is a statistic used for measuring dissimilarity betweensample sets The formula is shown in formula (2) Accordingto this formula we use intersection and union to do thecalculation and easily obtain the distance between eachmoviefeature mentioned above as follows

119869 (119860 119861) = 1 minus|119860 cap 119861|

|119860 cup 119861| (2)

422 Publish Year Distance The distance for publish yearbetweenmovies is a very special case for the human being Tohumanrsquos intuition the distance between movies published in1950s and 1960s seems much less than that between moviespublished in 2012 and 2013 In another word the distanceof publish year is not a linear function One reason forsuch phenomenon is that there are more and more moviespublished recently The statistic of number of movies in eachyear in our database is shown in Figure 5

Figure 5 illuminates that the number of publishedmoviesin each year is similar to an exponential function The trendline is shown as follows

119891 (119909) = 03584 sdot exp00598(119909minus1902) (3)

4 Journal of Applied Mathematics

Figure 3 Survey in our community

9

15

42

5

9

18

2

Publishing yearCountryTypeLanguage

DirectorActorDuration

system

Figure 4 Survey of top three liked movie factors

According to formula (3) we propose a new distance fun-ction for publish year PD(year

119894 year119895) shown in the follow-

ing

PD (year119894 year119895) =

intmax(year119894 year119895)min(year119894 year119895)

119891 (119909)

int2014

1902119891 (119909)

(4)

0

100

200

300

400

500

600

700

800

900

1880 1900 1920 1940 1960 1980 2000 2020

Count

Count

y = 03584e00598(xminus1902)

Figure 5 Statistic result of movie published in each year

423 Duration Distance Human being is very sensitive for asmall period of time which means that half an hour is nearlyas double time as a quarter of an hour in humanrsquos feelingTherefore we use a linear function shown in formula (5) tocompute the duration distance DUD(dur

119894 dur119895)

DUD (dur119894 dur119895) =

min (10038161003816100381610038161003816dur119894 minus dur119895

10038161003816100381610038161003816 200)

200 (5)

Here in order to normalize the distance we consider 200minutes as the maxima duration of a movie

Journal of Applied Mathematics 5

(1)

(2)

(3)

(4)

(5)

Figure 6 Matrix computation parallel demo

43 Overall Distance Webuild up a distancematrix [119899 119899] be-tween each movie in our database based on formula (6) asfollows

OD (119898119894 119898119895) = sum

119896isinmovie featuresweight

119896lowast distance

119896(119894 119895)

(6)

44 Parallel DistanceComputation Since our distancematrixis symmetric we can use such property to do distance com-putation in parallel

To do parallel computation two conditions must besatisfied

(1) Works assigned to different processing must be ap-proximately equal

(2) There is no interaction or data exchange between dif-ferent working processing

The distance matrix we need to build is a symmetric ma-trix in which all the elements on the main diagonal are equalto 0 formally defined as a matrix 119863 where 119863 = 119863

119879 and 119863[119894119894] = 0 as follows

119863 =

[[[[[[[[

[

0 12058201

12058202

sdot sdot sdot 1205820119899minus1

1205820119899

12058201

0 12058212

sdot sdot sdot 1205821119899minus1

1205821119899

12058202

12058212

0 sdot sdot sdot 1205822119899minus1

1205822119899

d

1205820119899minus1

1205821119899minus1

1205822119899minus1

sdot sdot sdot 0 120582119899minus1119899

1205820119899

1205821119899

1205822119899

sdot sdot sdot 120582119899minus1119899

0

]]]]]]]]

]

(7)

The scratch of the distancematrix shown in formula (7) isshown in Figure 6 Since the distancematrix is symmetric thedata in area (1) and area (2) and that in area (4) and area (5)are correspondingly equalTherefore instead of computing allof them we just compute the upper triangle of the matrixthat is the data in areas (1) and (5) However if we separatethe parallel tasks by rows of the matrix the works (numberof elements) in each task are (is) not equal (from 119899 minus 1 to 1)To overcome this problem the works needed in area (4) arefilled into area (3) so that we can group 119899 minus 1 unequal tasks

3

2

1

0

minus1

minus2

minus33210minus1minus2minus3

Figure 7 Affinity propagation demo [18]

into lceil1198992rceil equal tasks In this way a process pool where eachprocess computes distance in one task can be used to parallelthese lceil1198992rceil tasks

More specifically we firstly assign the main diagonal lineto 0 There is no need to compute the main diagonal sincewe know that the distance between one movie and itself is 0Then for each row 119894 we combine the distance computationsof119863[119899minus1minus 119894 119899minus 119894] 119863[119899minus1minus 119894 119899minus 119894+1] 119863[119899minus1minus 119894 119899minus2]119863[119899minus1minus119894 119899minus1] and those of119863[119894 119894+1] 119863[119894 119894+2] 119863[119894 119899minus2] 119863[119894 119899minus1] together as a task After computation we get anupper triangular matrix that is area (1) and area (4) Thenfill the lower triangle with the upper triangle that is let 119863 =

119863 + 119863119879

5 Preclustering System

In this section we use four clustering methods to investigatethe best cluster method which can be used in the final movierecommendation system The four clustering algorithms areaffinity propagation [13] DBSCAN [14] hierarchical cluster-ing [15] and random clustering for base line usage

51 Clustering Method

511 Affinity Propagation Affinity propagation which wasfirst proposed in [13] generates clusters by sending messagesbetween pairs of samples until convergence or changes fallingbelow a threshold A dataset is then described using a smallnumber of exemplars which are identified as those most re-presentative of other samples The messages sent betweenpairs represent the suitability for one sample to be the exem-plar of the other which is updated in response to the valuesfrom other pairsThis updating happens iteratively until con-vergence at which point the final exemplars are chosenand hence the final clustering is given as demonstrated inFigure 7

512 DBSCAN The DBSCAN algorithm [14] views clustersas areas of high density separated by areas of low density

6 Journal of Applied Mathematics

25

20

15

10

05

00

minus05

minus10

minus15

minus20minus20 2015100500minus05minus10minus15minus25

Figure 8 DBSCAN demo [18]

Due to this rather generic view clusters found by DBSCANcan be in any shape as opposed to k-means which assumesthat clusters are convex shapedThe central component to theDBSCAN is the concept of core samples which are samplesthat are in areas of high density A cluster is therefore a setof core samples each close to each other (measured by somedistancemeasure) and a set of non-core samples that are closeto a core sample (but are not themselves core samples) asdemonstrated in Figure 8

513 Hierarchical Clustering Hierarchical clustering [15] is ageneral family of clustering algorithms that build nested clus-ters bymerging them successivelyThis hierarchy of clusters isrepresented as a treeThe root of the tree is the unique clusterthat gathers all the samples the leaves being the clusters withonly one sample

The beauty of this clustering method is that we have noneed to provide the size of clustering ahead Instead we caninitialize with a large cut height and decrease the cut heightwhen the user has voted most movies in one cluster In thisway we can provide a larger cluster to fit userrsquos interests Asshown in Figure 9

514 Random Clustering We first fix the size of clustering to60 and then for each movie we randomly pick the cluster itbelongs to by generating a random number between 1 and 60

52 Preclustering System Output Because of seeking effi-ciency we output the clustering label results in two formatsOne can be used to check clustering labels by movie ID andanother is used to check all the moviesrsquo IDs in one clustergiven the cluster label By outputting the file in JSON we caneasily transfer data between preclustering system in Pythonand real time online recommendation system in PHP

6 Evaluation

Before the evaluation starts we created a survey and askedusers to select the top three movie features they care about as

Table 1 Comparison of results between different clustering meth-ods in first evaluation method

True positive False positive Accuracy rateAF 239 98 70920DB 174 77 69323HC 228 88 72152RA 125 104 54585

mentioned in Section 41 After one week of data collectionwe have the result which is shown in Figure 4 as mentioned

Based on the result we calculate the distance matrix andmake it as an input to the different clusters We implementedthe four clustering techniques by using python library Scikit-learn

61 Experimental Setup The evaluation has been conductedusing a university social website in China since the website isowned by one of our authors We add the recommendationfunction in a banner and put it on top of the movie service tolet users vote The overview voting component of the websiteis showed in Figure 10

There are three voting choices like do not like andunseen for each recommended movie Voting like scores 1voting do not like scores minus1 and voting unseen scores 0Because of the limited time of data collection we cannot waitfor users to watch the recommended movies Therefore weadd the unseen choices for users to select if they have not seenthe recommended movies and only consider the votes whichare like or do not like to evaluate our approach

62 Result After two weeks voting for themovie recommen-dation we have collected 168424 total votes 132360 votes arecollected in 12 days and used for recommendation majorityvoting as mentioned in Section 33 The remaining 6064votes are collected in two days after the recommendationsystem is deployed Since only the like vote and do not likevote are considered illegible we generated the results in thefollowing subsections

621 First Evaluation Result In the first evaluation resultlisted in Table 1 we calculated the true positive false positiveand accuracy for each clustering techniques True positiveequals the number of like votes and false positive equals thenumber of do not like votes Because most of the recom-mended movies are voted unseen there are a total of 1133votes counted and 216 users participated in the votingConsider the following

Accuracy = True PositiveTrue Positive + False Positive

(8)

As shown by the result in Table 1 the hierarchical clus-tering approach is the best However the accuracy of everyclustering approach is not satisfied Therefore we extractedthe usersrsquo voting history from our server database andanalyzed those data We found that some of the users madeless than 5 votes some of them made between 5 to 10 votes

Journal of Applied Mathematics 7

08

06

04

02

00

015

Hei

ght

X31

X37

X0 X20

X28

X38

X11

X35 X3 X29

X10

X36

X7 X24 X6 X3

0X2

5X5 X21

X19

X1 X13 X18

X14

X17

X15

X26 X12

X33 X8

X2X2

2X2

3 X32

X9 X27 X4 X34

X16

Figure 9 Hierarchical clustering demo

Beijing institute of technology

IMBD rating 71IMBD rating 71IMBD rating 71

IMBD rating 71 IMBD rating 85 IMBD rating 80

Isabella

Add

Voting

Bookmarkfind us on Weibo Blog Notifica-

tionsMessages Tasks Profiles logout

Forum Home Blog Search Support

like dislike unseen detailslike dislike unseen detailslike dislike unseen detailslike dislike unseen detailslike dislike unseen detailslike dislike unseen details

Figure 10 Website overview

There are a few users who made more than 20 votes In thosepeople who made more than 20 votes we found that 4 usersmade all their votes do not like Therefore almost one thirdof the false positive votes were made by only 4 users over 216users Table 3 shows one example from one typical user Thefirst column is the movie ID from imdb the second columnis the voting results and the third column is the voting timeWe think maybe this user dislikes our recommendation ideaso he made all the votes negative

There is another small portion of users made more than20 votes Most of the votes are negative Only a few votesare positive and no zero or unseen votes Table 4 shows oneof such examples It is probably because this user is votingnegative instead of the unseen choices for the unseenmoviessincemost of the users were voting unseenmore than like anddo not like

622 Second Evaluation Result These two situations men-tioned in Section 621 effect our result a lot in the first eval-uationTherefore wemade a second evaluation by calculatingeach userrsquos accuracy in formula (8) and then computing theaverage of all userrsquos accuracy Consider the following

Accuracy =sum accuracy

Total Number ofUsers (9)

Table 2 Comparison of results between different clustering meth-ods in second evaluation method

Accuracy rateAF 8095DB 8471HC 8203RA 6351

In the second evaluation we wait for another 6 hours tocollect more data There are 7100 votes in total 1420 votesare considered and 242 users participated in voting Basedon the second evaluation we obtained much better result asshown in Table 2 DBSCAN (density-based clustering) affin-ity propagation and hierarchical clustering obtained over80 accuracy DBSCAN reached to 8471

63 Comparison with Related Work In this section we com-pared our result with those of with several other movie re-commendation approachsystems Pomerantz and Dudek[16] used a hierarchical Bayesian approach combined withthe item similarity of the content for themovie recommenderand achieved the classifiers of 708 as the best result InBiancalana et alrsquos paper [11] the best result is 824 obtainedby combining three classifiers in a neural network Basu et al

8 Journal of Applied Mathematics

Table 3 Example of all negative votes

imdb Vote Last updatett1716777 minus1 2013-12-06 233655tt0183649 minus1 2013-12-06 233656tt1728196 minus1 2013-12-06 233658tt0014429 minus1 2013-12-06 233659tt0034248 minus1 2013-12-06 233701tt0017136 minus1 2013-12-06 233702tt0035169 minus1 2013-12-06 233703tt0023940 minus1 2013-12-06 233704tt0035209 minus1 2013-12-06 233705tt0047876 minus1 2013-12-06 233723tt0106535 minus1 2013-12-06 233724tt0121164 minus1 2013-12-06 233727tt0006864 minus1 2013-12-06 233728tt0499141 minus1 2013-12-06 233730tt0049010 minus1 2013-12-06 233732tt1298650 minus1 2013-12-06 234039tt0408236 minus1 2013-12-06 234040tt0162661 minus1 2013-12-06 234041tt0349903 minus1 2013-12-06 234043tt0114369 minus1 2013-12-06 234044tt0096895 minus1 2013-12-06 234045tt0111161 minus1 2013-12-06 234049tt2992146 minus1 2013-12-06 234053tt0082971 minus1 2013-12-06 234056tt1075419 minus1 2013-12-06 234058

[17] presented an inductive learning approach to recommen-dation and achieved an averaged precision of 83

Our best result (8471) achieved by DBSCAN [14] asshown in Table 2 improves 171 compared to best result ofrelated works mentioned above Meanwhile all these relatedworks are computed offline and they use the stored modelswhen testing and these models cannot be updated in timesince it takes too long to redo the learning process Our ap-proach on the other hand does not need to store usermodelsand it updates the recommendation result in real time bypreclustering movies and using a very simple learning ap-proach (majority voting) to provide movie recommendationSince movies are updated much less frequently than usersrsquovotes our approach is much more time efficient than relatedworks To sum up our research becomes significantly mean-ingful as it realizes the real timemodel updating while it doesnot compromise the accuracy

7 Conclusion and Future Work

In this paper we proposed a newdistance function for publishyear (year related) to compute the distance matrix and thenimplemented a real time movie recommendation systembased on content (movie) using preclustering and majorityvoting We implemented four different clustering techniquesin the preclustering process and obtained 8471 accuracy

Table 4 Example of nonzero votes

imdb Vote Last updatett0068767 minus1 2013-12-06 223330tt0109958 minus1 2013-12-06 223332tt0047034 minus1 2013-12-06 223333tt0397892 minus1 2013-12-06 223335tt0032551 minus1 2013-12-06 223336tt0110008 1 2013-12-06 223338tt0416881 minus1 2013-12-06 223340tt1233499 minus1 2013-12-06 223342tt1754691 minus1 2013-12-06 223345tt1034303 minus1 2013-12-06 223347tt0015324 minus1 2013-12-06 223348tt0049408 minus1 2013-12-06 223350tt0033467 minus1 2013-12-06 223351tt0000417 minus1 2013-12-06 223354tt1437358 minus1 2013-12-06 223356tt0063105 minus1 2013-12-06 223357tt0253474 minus1 2013-12-06 223358tt0034248 1 2013-12-06 223400tt0018578 1 2013-12-06 223404tt0335345 minus1 2013-12-06 223407tt0028445 1 2013-12-06 223410tt0040536 1 2013-12-06 223416tt0332379 minus1 2013-12-06 223419tt1728196 minus1 2013-12-06 223421

as the best result which is better than some of the otherresearch papersrsquo approachesWe realized a real time updatingmodel with no compromises on accuracy Our approach caneven work better with special user groups such as people incolleges universities or professional communities

The next step in this research is to make the systemadaptive to bewidely used bymany other types of groups suchas articles news and music and to obtain better accuracyOne of the limitations in our approach is that if there isa cluster that contains a lot of positive votes as well asa lot of negative ones we will not recommend movies inthis cluster However this situation may happen becauseof two totally different reasons One is that the user doesnot care about the movies in this cluster In this case norecommendation in this cluster is a wise choice Howeverif the user really likes the movies in this cluster and heshewatched a lot of movies in this cluster and some moviesheshe likes while some heshe dislikes ignoring this clusterin such situation will dissatisfy the user To conquer thislimitation we can give weights to the votes from users Inthe movie which is voted positive by the user and is votedpositive by many other people as well the weight of suchvote will be decreased while if the movie is voted positiveby the user but is voted negative by most of others thismeans such vote reveals the special taste from this user andgains higher weight and vice versa In a more specific waywe can set the weight of positive as 119873(119875 + 119873) and theweight of negative as 119875(119875 + 119873) where 119875 is the total positive

Journal of Applied Mathematics 9

votes from all users towards this movie and 119873 is the totalnegative votes from all users towards this movie In this waywe can still ignore the cluster in first situation while over-coming the limitation we had in the second situation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors especially thank Dr Khaled Rasheed for provid-ing them with suggestions and information regarding theirresearch The authors also thank Mr Hongfei Yan for the re-lated work and discussions

References

[1] M A Ghazanfar and A Prugel-Bennett ldquoAn improved switch-ing hybrid recommender system using Naive Bayes classifierand collaborative filteringrdquo in Proceedings of the InternationalMultiConference of Engineers and Computer Scientists (IMECSrsquo10) pp 493ndash502 March 2010

[2] A Bellogın I Cantador P Castells and A Ortigosa ldquoDis-covering relevant preferences in a personalised recommendersystemusingmachine learning techniquesrdquo inProceedings of theECML-PKDDWorkshop on Preference Learning 2008

[3] F-C LinH-WYu C-HHsu andT-CWeng ldquoRecommenda-tion system for localized products in vendingmachinesrdquo ExpertSystems with Applications vol 38 no 8 pp 9129ndash9138 2011

[4] K-J Kim and H Ahn ldquoA recommender system using GA K-means clustering in an online shoppingmarketrdquo Expert Systemswith Applications vol 34 no 2 pp 1200ndash1209 2008

[5] S B Aher and LM R J Lobo ldquoCombination ofmachine learn-ing algorithms for recommendation of courses in E-LearningSystem based on historical datardquoKnowledge-Based Systems vol51 pp 1ndash14 2013

[6] G M Dakhel and M Mahdavi ldquoProviding an effective col-laborative filtering algorithm based on distance measures andneighborsrsquo votingrdquo International Journal of Computer Informa-tion Systems and Industrial Management Applications vol 5 pp524ndash531 2013

[7] W H Ujwala V R Sheetal and M Debajyoti ldquoA hybrid webrecommendation system based on the improved associationrule mining algorithmrdquo Journal of Software Engineering andApplications vol 6 article 396 2013

[8] I Tatli and A Birturk ldquoA tag-based hybrid music recom-mendation system using semantic relations and multi-domaininformationrdquo in Proceedings of the 11th IEEE InternationalConference on Data Mining Workshops (ICDMW rsquo11) pp 548ndash554 December 2011

[9] Q Yang J Fan J Wang and L Zhou ldquoPersonalizing web pagerecommendation via collaborative filtering and topic-awareMarkov modelrdquo in Proceedings of the 10th IEEE InternationalConference on Data Mining (ICDM rsquo10) pp 1145ndash1150 Decem-ber 2010

[10] X Cai M Bain A Krzywicki et al ldquoCollaborative filtering forpeople to people recommendation in social networksrdquo in AI2010 Advances in Artificial Intelligence vol 6464 pp 476ndash485Springer Berlin Germany 2010

[11] C Biancalana F Gasparetti A Micarelli A Miola and GSansonetti ldquoContext-aware movie recommendation based onsignal processing and machine learningrdquo in Proceedings of the2nd Challenge on Context-Aware Movie Recommendation pp5ndash10 October 2011

[12] P Jaccard ldquoThe distribution of the flora in the alpine zone 1rdquoThe New Phytologist vol 11 no 2 pp 37ndash50 1912

[13] B J Frey and D Dueck ldquoClustering by passing messages be-tween data pointsrdquo Science vol 315 no 5814 pp 972ndash976 2007

[14] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining (KDD rsquo96) vol 96 pp226ndash231 1996

[15] AK Jain andRCDubesAlgorithms for ClusteringData Pren-tice-Hall Upper Saddle River NJ USA 1988

[16] D Pomerantz and G Dudek ldquoContext dependent movie re-commendations using a hierarchical Bayesian modelrdquo in Ad-vances in artificial intelligence vol 5549 of Lecture Notes inComputer Science pp 98ndash109 Springer Berlin Germany 2009

[17] C Basu H Hirsh and W Cohen ldquoRecommendation as classi-fication using social and content-based information in recom-mendationrdquo in Proceedings of the 15th National Conference onArtificial Intelligence (AAAI rsquo98) pp 714ndash720 July 1998

[18] ldquo23 clusteringmdashscikit-learn 014 documentationrdquo httpscikit-learnorgstablemodulesclusteringhtmlclustering

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 2: Research Article Precomputed Clustering for Movie ...downloads.hindawi.com/journals/jam/2014/742341.pdfResearch Article Precomputed Clustering for Movie Recommendation System in Real

2 Journal of Applied Mathematics

Distance computation

systemPreclustering

systemReal time online recommendation

system

Figure 1 System overview

DBSCAN (density-based clustering) affinity propagationhierarchical clustering and random clustering as a base line

We have tested this proposal with all the clusteringtechniques in a university social website owned by one ofour authors There is a subsystem with movie service in thewebsite We add a recommendation function in a banner andput it on top of the section to let user make votes We offerthem all the clustering results from four cluster algorithmsshuffled together and record the voting results separately

The rest of the paper is organized as follows Section 2gives an overview of related works in whichmachine learningtechniques have been applied to recommendation systemsSection 3 is the system overview of our recommendationsystem Section 4 illuminates the distance functions we useto compute the distance matrix Section 5 introduces all theclustering techniques used in our preclustering systemSection 6 includes the experimental setup and evaluation re-sults Finally Section 7 concludes with discussions and futureresearch

2 Related Work

Machine learning techniques are very useful when hugeamounts of data have to be classified and analyzed whichnowadays is a very common situation in many scenarios es-pecially in the recommendation system There are a lot ofdifferent machine learning techniques used in different rec-ommendation systems such as Naive Bayes classification [1]decision tree [2] k-means clustering and improvement [3ndash6]and so forth

Generally the recommendation systems are divided intotwo major categories such as collaborative recommendationsystem and content based recommendation system [7] Incase of collaborative recommendation systems these try toseek out users who share same tastes as that of the given useras well as to recommend themovies according to the liking ofthese sought users For example Tatli and Birturk describedan approach for creating music recommendations based onuser-supplied tags that are augmented with a hierarchicalstructure extracted for top level genres from Dbpedia Inthis structure each genre is represented by its stylistic ori-gins typical instruments derivative forms subgenres and fu-sion genresThey use this well-organized structure in dimen-sionality reduction in user profiling [8] Yang et al proposeda personalized web page recommendation model calledPIGEON (abbreviation for PersonalIzed web paGe rEcom-mendatiON) via collaborative filtering and a topic-aware

Markov model and proposed a graph-based iteration algo-rithm to discover userrsquos interested topics based on whichuser similarities are measured [9] Cai et al also proposed amodel that fully captures the bilateral role of user interactionswithin a social network and formulated collaborative filteringmethods to enable people to people recommendation [10] AsI mentioned in Section 1 making recommendation based onuser is not working properly in some particular areas such asa community network system or university network

The content based recommendation systems try to rec-ommend contents similar to thoseweb sites the user has likedBiancalana et al proposed two different context-aware ap-proaches for the movie recommendation task one is a hy-brid recommender that assesses available contextual factorsrelated to time in order to increase the performance of tradi-tional CF approaches and the other one aims at identifyingusers in a household that submitted a given rating [11] Thislatter approach is based on machine learning techniquesnamely neural networks and majority voting In our paperwe focus on the content based recommendation and usethe majority voting and clustering techniques We obtainedbetter results compared to their research whichwill be shownin Section 6

Some of the researchers proposed hybrid recommenda-tion approaches by combining different approaches Ghazan-far and Prugel-Bennett proposed a unique switching hybridrecommendation approach by combining a Naive Bayes clas-sification approach with the collaborative filtering [1] Ujwalaet al presented and investigated an approach based onweighted Association Rule Mining Algorithm and text min-ing [7] Bellogın et al implemented decision trees and at-tribute selections together to build the recommendationmodel to find the most relevant preferences of user and sys-tem [2] The idea of hybrid recommendation approach givesus inspiration to combine different approaches to implementthe movie recommendation for future work

3 System Overview

Figure 1 provides a high-level overview of our newmovie rec-ommendation approachAs shown in Figure 1 the newmovierecommendation system comes in three parts distance com-putation system preclustering system and real time onlinerecommendation system

31 Distance Computation System Distance computationsystem computes the distance between differentmovies based

Journal of Applied Mathematics 3

+

+++

+

++++

minus

C1 C2

C3

C4

Figure 2 A demo of voting The probability that recommendedmovies are in 119862

1is 58

on different properties of the movies including the moviestypes publish year countries of publishing companies lan-guages directors casts and duration time We first computethe Jaccard distance based onmovies types countries of pub-lishing companies languages directors and casts and thenthe distance based on publish year and duration time by thedistance we defined in Sections 422 and 423We then com-pute the overall distances between each movie sample bysummarizing them together with the weights which we ob-tained from the survey mentioned in Section 41

32 Preclustering System Preclustering system separatesmovies into different clusters before giving them to onlinerecommendation system In this system we tried fivekinds of different clustering algorithms affinity propagationDBSCAN hierarchical clustering spectral clustering andrandom generator among which spectral clustering is tooslow to get the result in demanded time Therefore we sentthe remaining four clustering results to the recommendationsystem with two data formats one can get the cluster label byone moviersquos IMDB identifier and another one can drag all themoviesrsquo IMDB identifiers by one of cluster labels In this waythe online recommendation system in Section 33 can quicklyget the information it wants

33 Real Time Online Recommendation System After gettingthe preclustering results the website read the history of usersrsquovotes on movies and used the majority of the voting togenerate recommendations Inmore detail if one user119880

119894likes

movie1198721198951which belongs to cluster 119862

1198961 119880119894will give 119862

1198961one

V119900119905119890 on 1198621198961 On contract if 119880

119894dislikes movie 119872

1198952in 1198621198962

119880119894will cancel a V119900119905119890 for 119862

1198962 If a cluster gets negative V119900119905119890

at the end it will return to 0 After 119880119894votes all the movies

heshe comments different clusters will get different votesThe movie recommendation system will recommend moviesaccording to the votes of clusters 119862

119895as 119881(119862

119895) as follows

119875 (119862119894) =

119881 (119862119894)

sum119899

119895=1119881(119862119895)

(1)

where 119875(119862119894) is the probability that a movie should be recom-

mended from cluster 119862119894and 119881(119862

119894) is the number of votes in

cluster 119862119894 119899 is the number of clusters As shown in Figure 2

To make sure that in our evaluation the reason that uservotes a movie we recommended as dislike is not because themovie itself is really bad we recommend movies with theIMDB score higher than 7 in the clusters which have alreadybeen voted out by the user

4 Distance Computation System

41 Weight Survey Moviesrsquo properties create a very specialspace where the weights of each dimension are treated com-pletely different For example the type of a movie is of coursemore important than the duration of it Therefore we cannotuse the default distance function provided by Scikit-learn [18]package that we use Instead the feature selection like in [2]or starting a survey to figure out the weights to different di-mensions should be implemented In other research areaswhere we human beings do not knowwhich feature is impor-tant such as features in carcinogenic we always use featureselection to decide the weights of different features In movierecommendation area however we can investigate it by thesurvey shown in Figure 3 since people knowwhy they like themovie By carefully designing a survey and letting users votethe top three factors we obtained a result shown in Figure 4

From the survey result shown in Figure 4 we can figureout that the weight of publish year is 00915 weight of countryis 0147 weight of type is 04167 weight of language is 00545weight of director is 00896 weight of casts is 01792 andweight of duration is 00214These weights will be used in theoverall distance computation in Section 43

42 Distance Function

421 Jaccard Distance Since for countries languages castsand types of the movie the distance is determined by howmany same and different items there are between sets wedecided to use Jaccard distance proposed in [12] Jaccard dis-tance is a statistic used for measuring dissimilarity betweensample sets The formula is shown in formula (2) Accordingto this formula we use intersection and union to do thecalculation and easily obtain the distance between eachmoviefeature mentioned above as follows

119869 (119860 119861) = 1 minus|119860 cap 119861|

|119860 cup 119861| (2)

422 Publish Year Distance The distance for publish yearbetweenmovies is a very special case for the human being Tohumanrsquos intuition the distance between movies published in1950s and 1960s seems much less than that between moviespublished in 2012 and 2013 In another word the distanceof publish year is not a linear function One reason forsuch phenomenon is that there are more and more moviespublished recently The statistic of number of movies in eachyear in our database is shown in Figure 5

Figure 5 illuminates that the number of publishedmoviesin each year is similar to an exponential function The trendline is shown as follows

119891 (119909) = 03584 sdot exp00598(119909minus1902) (3)

4 Journal of Applied Mathematics

Figure 3 Survey in our community

9

15

42

5

9

18

2

Publishing yearCountryTypeLanguage

DirectorActorDuration

system

Figure 4 Survey of top three liked movie factors

According to formula (3) we propose a new distance fun-ction for publish year PD(year

119894 year119895) shown in the follow-

ing

PD (year119894 year119895) =

intmax(year119894 year119895)min(year119894 year119895)

119891 (119909)

int2014

1902119891 (119909)

(4)

0

100

200

300

400

500

600

700

800

900

1880 1900 1920 1940 1960 1980 2000 2020

Count

Count

y = 03584e00598(xminus1902)

Figure 5 Statistic result of movie published in each year

423 Duration Distance Human being is very sensitive for asmall period of time which means that half an hour is nearlyas double time as a quarter of an hour in humanrsquos feelingTherefore we use a linear function shown in formula (5) tocompute the duration distance DUD(dur

119894 dur119895)

DUD (dur119894 dur119895) =

min (10038161003816100381610038161003816dur119894 minus dur119895

10038161003816100381610038161003816 200)

200 (5)

Here in order to normalize the distance we consider 200minutes as the maxima duration of a movie

Journal of Applied Mathematics 5

(1)

(2)

(3)

(4)

(5)

Figure 6 Matrix computation parallel demo

43 Overall Distance Webuild up a distancematrix [119899 119899] be-tween each movie in our database based on formula (6) asfollows

OD (119898119894 119898119895) = sum

119896isinmovie featuresweight

119896lowast distance

119896(119894 119895)

(6)

44 Parallel DistanceComputation Since our distancematrixis symmetric we can use such property to do distance com-putation in parallel

To do parallel computation two conditions must besatisfied

(1) Works assigned to different processing must be ap-proximately equal

(2) There is no interaction or data exchange between dif-ferent working processing

The distance matrix we need to build is a symmetric ma-trix in which all the elements on the main diagonal are equalto 0 formally defined as a matrix 119863 where 119863 = 119863

119879 and 119863[119894119894] = 0 as follows

119863 =

[[[[[[[[

[

0 12058201

12058202

sdot sdot sdot 1205820119899minus1

1205820119899

12058201

0 12058212

sdot sdot sdot 1205821119899minus1

1205821119899

12058202

12058212

0 sdot sdot sdot 1205822119899minus1

1205822119899

d

1205820119899minus1

1205821119899minus1

1205822119899minus1

sdot sdot sdot 0 120582119899minus1119899

1205820119899

1205821119899

1205822119899

sdot sdot sdot 120582119899minus1119899

0

]]]]]]]]

]

(7)

The scratch of the distancematrix shown in formula (7) isshown in Figure 6 Since the distancematrix is symmetric thedata in area (1) and area (2) and that in area (4) and area (5)are correspondingly equalTherefore instead of computing allof them we just compute the upper triangle of the matrixthat is the data in areas (1) and (5) However if we separatethe parallel tasks by rows of the matrix the works (numberof elements) in each task are (is) not equal (from 119899 minus 1 to 1)To overcome this problem the works needed in area (4) arefilled into area (3) so that we can group 119899 minus 1 unequal tasks

3

2

1

0

minus1

minus2

minus33210minus1minus2minus3

Figure 7 Affinity propagation demo [18]

into lceil1198992rceil equal tasks In this way a process pool where eachprocess computes distance in one task can be used to parallelthese lceil1198992rceil tasks

More specifically we firstly assign the main diagonal lineto 0 There is no need to compute the main diagonal sincewe know that the distance between one movie and itself is 0Then for each row 119894 we combine the distance computationsof119863[119899minus1minus 119894 119899minus 119894] 119863[119899minus1minus 119894 119899minus 119894+1] 119863[119899minus1minus 119894 119899minus2]119863[119899minus1minus119894 119899minus1] and those of119863[119894 119894+1] 119863[119894 119894+2] 119863[119894 119899minus2] 119863[119894 119899minus1] together as a task After computation we get anupper triangular matrix that is area (1) and area (4) Thenfill the lower triangle with the upper triangle that is let 119863 =

119863 + 119863119879

5 Preclustering System

In this section we use four clustering methods to investigatethe best cluster method which can be used in the final movierecommendation system The four clustering algorithms areaffinity propagation [13] DBSCAN [14] hierarchical cluster-ing [15] and random clustering for base line usage

51 Clustering Method

511 Affinity Propagation Affinity propagation which wasfirst proposed in [13] generates clusters by sending messagesbetween pairs of samples until convergence or changes fallingbelow a threshold A dataset is then described using a smallnumber of exemplars which are identified as those most re-presentative of other samples The messages sent betweenpairs represent the suitability for one sample to be the exem-plar of the other which is updated in response to the valuesfrom other pairsThis updating happens iteratively until con-vergence at which point the final exemplars are chosenand hence the final clustering is given as demonstrated inFigure 7

512 DBSCAN The DBSCAN algorithm [14] views clustersas areas of high density separated by areas of low density

6 Journal of Applied Mathematics

25

20

15

10

05

00

minus05

minus10

minus15

minus20minus20 2015100500minus05minus10minus15minus25

Figure 8 DBSCAN demo [18]

Due to this rather generic view clusters found by DBSCANcan be in any shape as opposed to k-means which assumesthat clusters are convex shapedThe central component to theDBSCAN is the concept of core samples which are samplesthat are in areas of high density A cluster is therefore a setof core samples each close to each other (measured by somedistancemeasure) and a set of non-core samples that are closeto a core sample (but are not themselves core samples) asdemonstrated in Figure 8

513 Hierarchical Clustering Hierarchical clustering [15] is ageneral family of clustering algorithms that build nested clus-ters bymerging them successivelyThis hierarchy of clusters isrepresented as a treeThe root of the tree is the unique clusterthat gathers all the samples the leaves being the clusters withonly one sample

The beauty of this clustering method is that we have noneed to provide the size of clustering ahead Instead we caninitialize with a large cut height and decrease the cut heightwhen the user has voted most movies in one cluster In thisway we can provide a larger cluster to fit userrsquos interests Asshown in Figure 9

514 Random Clustering We first fix the size of clustering to60 and then for each movie we randomly pick the cluster itbelongs to by generating a random number between 1 and 60

52 Preclustering System Output Because of seeking effi-ciency we output the clustering label results in two formatsOne can be used to check clustering labels by movie ID andanother is used to check all the moviesrsquo IDs in one clustergiven the cluster label By outputting the file in JSON we caneasily transfer data between preclustering system in Pythonand real time online recommendation system in PHP

6 Evaluation

Before the evaluation starts we created a survey and askedusers to select the top three movie features they care about as

Table 1 Comparison of results between different clustering meth-ods in first evaluation method

True positive False positive Accuracy rateAF 239 98 70920DB 174 77 69323HC 228 88 72152RA 125 104 54585

mentioned in Section 41 After one week of data collectionwe have the result which is shown in Figure 4 as mentioned

Based on the result we calculate the distance matrix andmake it as an input to the different clusters We implementedthe four clustering techniques by using python library Scikit-learn

61 Experimental Setup The evaluation has been conductedusing a university social website in China since the website isowned by one of our authors We add the recommendationfunction in a banner and put it on top of the movie service tolet users vote The overview voting component of the websiteis showed in Figure 10

There are three voting choices like do not like andunseen for each recommended movie Voting like scores 1voting do not like scores minus1 and voting unseen scores 0Because of the limited time of data collection we cannot waitfor users to watch the recommended movies Therefore weadd the unseen choices for users to select if they have not seenthe recommended movies and only consider the votes whichare like or do not like to evaluate our approach

62 Result After two weeks voting for themovie recommen-dation we have collected 168424 total votes 132360 votes arecollected in 12 days and used for recommendation majorityvoting as mentioned in Section 33 The remaining 6064votes are collected in two days after the recommendationsystem is deployed Since only the like vote and do not likevote are considered illegible we generated the results in thefollowing subsections

621 First Evaluation Result In the first evaluation resultlisted in Table 1 we calculated the true positive false positiveand accuracy for each clustering techniques True positiveequals the number of like votes and false positive equals thenumber of do not like votes Because most of the recom-mended movies are voted unseen there are a total of 1133votes counted and 216 users participated in the votingConsider the following

Accuracy = True PositiveTrue Positive + False Positive

(8)

As shown by the result in Table 1 the hierarchical clus-tering approach is the best However the accuracy of everyclustering approach is not satisfied Therefore we extractedthe usersrsquo voting history from our server database andanalyzed those data We found that some of the users madeless than 5 votes some of them made between 5 to 10 votes

Journal of Applied Mathematics 7

08

06

04

02

00

015

Hei

ght

X31

X37

X0 X20

X28

X38

X11

X35 X3 X29

X10

X36

X7 X24 X6 X3

0X2

5X5 X21

X19

X1 X13 X18

X14

X17

X15

X26 X12

X33 X8

X2X2

2X2

3 X32

X9 X27 X4 X34

X16

Figure 9 Hierarchical clustering demo

Beijing institute of technology

IMBD rating 71IMBD rating 71IMBD rating 71

IMBD rating 71 IMBD rating 85 IMBD rating 80

Isabella

Add

Voting

Bookmarkfind us on Weibo Blog Notifica-

tionsMessages Tasks Profiles logout

Forum Home Blog Search Support

like dislike unseen detailslike dislike unseen detailslike dislike unseen detailslike dislike unseen detailslike dislike unseen detailslike dislike unseen details

Figure 10 Website overview

There are a few users who made more than 20 votes In thosepeople who made more than 20 votes we found that 4 usersmade all their votes do not like Therefore almost one thirdof the false positive votes were made by only 4 users over 216users Table 3 shows one example from one typical user Thefirst column is the movie ID from imdb the second columnis the voting results and the third column is the voting timeWe think maybe this user dislikes our recommendation ideaso he made all the votes negative

There is another small portion of users made more than20 votes Most of the votes are negative Only a few votesare positive and no zero or unseen votes Table 4 shows oneof such examples It is probably because this user is votingnegative instead of the unseen choices for the unseenmoviessincemost of the users were voting unseenmore than like anddo not like

622 Second Evaluation Result These two situations men-tioned in Section 621 effect our result a lot in the first eval-uationTherefore wemade a second evaluation by calculatingeach userrsquos accuracy in formula (8) and then computing theaverage of all userrsquos accuracy Consider the following

Accuracy =sum accuracy

Total Number ofUsers (9)

Table 2 Comparison of results between different clustering meth-ods in second evaluation method

Accuracy rateAF 8095DB 8471HC 8203RA 6351

In the second evaluation we wait for another 6 hours tocollect more data There are 7100 votes in total 1420 votesare considered and 242 users participated in voting Basedon the second evaluation we obtained much better result asshown in Table 2 DBSCAN (density-based clustering) affin-ity propagation and hierarchical clustering obtained over80 accuracy DBSCAN reached to 8471

63 Comparison with Related Work In this section we com-pared our result with those of with several other movie re-commendation approachsystems Pomerantz and Dudek[16] used a hierarchical Bayesian approach combined withthe item similarity of the content for themovie recommenderand achieved the classifiers of 708 as the best result InBiancalana et alrsquos paper [11] the best result is 824 obtainedby combining three classifiers in a neural network Basu et al

8 Journal of Applied Mathematics

Table 3 Example of all negative votes

imdb Vote Last updatett1716777 minus1 2013-12-06 233655tt0183649 minus1 2013-12-06 233656tt1728196 minus1 2013-12-06 233658tt0014429 minus1 2013-12-06 233659tt0034248 minus1 2013-12-06 233701tt0017136 minus1 2013-12-06 233702tt0035169 minus1 2013-12-06 233703tt0023940 minus1 2013-12-06 233704tt0035209 minus1 2013-12-06 233705tt0047876 minus1 2013-12-06 233723tt0106535 minus1 2013-12-06 233724tt0121164 minus1 2013-12-06 233727tt0006864 minus1 2013-12-06 233728tt0499141 minus1 2013-12-06 233730tt0049010 minus1 2013-12-06 233732tt1298650 minus1 2013-12-06 234039tt0408236 minus1 2013-12-06 234040tt0162661 minus1 2013-12-06 234041tt0349903 minus1 2013-12-06 234043tt0114369 minus1 2013-12-06 234044tt0096895 minus1 2013-12-06 234045tt0111161 minus1 2013-12-06 234049tt2992146 minus1 2013-12-06 234053tt0082971 minus1 2013-12-06 234056tt1075419 minus1 2013-12-06 234058

[17] presented an inductive learning approach to recommen-dation and achieved an averaged precision of 83

Our best result (8471) achieved by DBSCAN [14] asshown in Table 2 improves 171 compared to best result ofrelated works mentioned above Meanwhile all these relatedworks are computed offline and they use the stored modelswhen testing and these models cannot be updated in timesince it takes too long to redo the learning process Our ap-proach on the other hand does not need to store usermodelsand it updates the recommendation result in real time bypreclustering movies and using a very simple learning ap-proach (majority voting) to provide movie recommendationSince movies are updated much less frequently than usersrsquovotes our approach is much more time efficient than relatedworks To sum up our research becomes significantly mean-ingful as it realizes the real timemodel updating while it doesnot compromise the accuracy

7 Conclusion and Future Work

In this paper we proposed a newdistance function for publishyear (year related) to compute the distance matrix and thenimplemented a real time movie recommendation systembased on content (movie) using preclustering and majorityvoting We implemented four different clustering techniquesin the preclustering process and obtained 8471 accuracy

Table 4 Example of nonzero votes

imdb Vote Last updatett0068767 minus1 2013-12-06 223330tt0109958 minus1 2013-12-06 223332tt0047034 minus1 2013-12-06 223333tt0397892 minus1 2013-12-06 223335tt0032551 minus1 2013-12-06 223336tt0110008 1 2013-12-06 223338tt0416881 minus1 2013-12-06 223340tt1233499 minus1 2013-12-06 223342tt1754691 minus1 2013-12-06 223345tt1034303 minus1 2013-12-06 223347tt0015324 minus1 2013-12-06 223348tt0049408 minus1 2013-12-06 223350tt0033467 minus1 2013-12-06 223351tt0000417 minus1 2013-12-06 223354tt1437358 minus1 2013-12-06 223356tt0063105 minus1 2013-12-06 223357tt0253474 minus1 2013-12-06 223358tt0034248 1 2013-12-06 223400tt0018578 1 2013-12-06 223404tt0335345 minus1 2013-12-06 223407tt0028445 1 2013-12-06 223410tt0040536 1 2013-12-06 223416tt0332379 minus1 2013-12-06 223419tt1728196 minus1 2013-12-06 223421

as the best result which is better than some of the otherresearch papersrsquo approachesWe realized a real time updatingmodel with no compromises on accuracy Our approach caneven work better with special user groups such as people incolleges universities or professional communities

The next step in this research is to make the systemadaptive to bewidely used bymany other types of groups suchas articles news and music and to obtain better accuracyOne of the limitations in our approach is that if there isa cluster that contains a lot of positive votes as well asa lot of negative ones we will not recommend movies inthis cluster However this situation may happen becauseof two totally different reasons One is that the user doesnot care about the movies in this cluster In this case norecommendation in this cluster is a wise choice Howeverif the user really likes the movies in this cluster and heshewatched a lot of movies in this cluster and some moviesheshe likes while some heshe dislikes ignoring this clusterin such situation will dissatisfy the user To conquer thislimitation we can give weights to the votes from users Inthe movie which is voted positive by the user and is votedpositive by many other people as well the weight of suchvote will be decreased while if the movie is voted positiveby the user but is voted negative by most of others thismeans such vote reveals the special taste from this user andgains higher weight and vice versa In a more specific waywe can set the weight of positive as 119873(119875 + 119873) and theweight of negative as 119875(119875 + 119873) where 119875 is the total positive

Journal of Applied Mathematics 9

votes from all users towards this movie and 119873 is the totalnegative votes from all users towards this movie In this waywe can still ignore the cluster in first situation while over-coming the limitation we had in the second situation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors especially thank Dr Khaled Rasheed for provid-ing them with suggestions and information regarding theirresearch The authors also thank Mr Hongfei Yan for the re-lated work and discussions

References

[1] M A Ghazanfar and A Prugel-Bennett ldquoAn improved switch-ing hybrid recommender system using Naive Bayes classifierand collaborative filteringrdquo in Proceedings of the InternationalMultiConference of Engineers and Computer Scientists (IMECSrsquo10) pp 493ndash502 March 2010

[2] A Bellogın I Cantador P Castells and A Ortigosa ldquoDis-covering relevant preferences in a personalised recommendersystemusingmachine learning techniquesrdquo inProceedings of theECML-PKDDWorkshop on Preference Learning 2008

[3] F-C LinH-WYu C-HHsu andT-CWeng ldquoRecommenda-tion system for localized products in vendingmachinesrdquo ExpertSystems with Applications vol 38 no 8 pp 9129ndash9138 2011

[4] K-J Kim and H Ahn ldquoA recommender system using GA K-means clustering in an online shoppingmarketrdquo Expert Systemswith Applications vol 34 no 2 pp 1200ndash1209 2008

[5] S B Aher and LM R J Lobo ldquoCombination ofmachine learn-ing algorithms for recommendation of courses in E-LearningSystem based on historical datardquoKnowledge-Based Systems vol51 pp 1ndash14 2013

[6] G M Dakhel and M Mahdavi ldquoProviding an effective col-laborative filtering algorithm based on distance measures andneighborsrsquo votingrdquo International Journal of Computer Informa-tion Systems and Industrial Management Applications vol 5 pp524ndash531 2013

[7] W H Ujwala V R Sheetal and M Debajyoti ldquoA hybrid webrecommendation system based on the improved associationrule mining algorithmrdquo Journal of Software Engineering andApplications vol 6 article 396 2013

[8] I Tatli and A Birturk ldquoA tag-based hybrid music recom-mendation system using semantic relations and multi-domaininformationrdquo in Proceedings of the 11th IEEE InternationalConference on Data Mining Workshops (ICDMW rsquo11) pp 548ndash554 December 2011

[9] Q Yang J Fan J Wang and L Zhou ldquoPersonalizing web pagerecommendation via collaborative filtering and topic-awareMarkov modelrdquo in Proceedings of the 10th IEEE InternationalConference on Data Mining (ICDM rsquo10) pp 1145ndash1150 Decem-ber 2010

[10] X Cai M Bain A Krzywicki et al ldquoCollaborative filtering forpeople to people recommendation in social networksrdquo in AI2010 Advances in Artificial Intelligence vol 6464 pp 476ndash485Springer Berlin Germany 2010

[11] C Biancalana F Gasparetti A Micarelli A Miola and GSansonetti ldquoContext-aware movie recommendation based onsignal processing and machine learningrdquo in Proceedings of the2nd Challenge on Context-Aware Movie Recommendation pp5ndash10 October 2011

[12] P Jaccard ldquoThe distribution of the flora in the alpine zone 1rdquoThe New Phytologist vol 11 no 2 pp 37ndash50 1912

[13] B J Frey and D Dueck ldquoClustering by passing messages be-tween data pointsrdquo Science vol 315 no 5814 pp 972ndash976 2007

[14] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining (KDD rsquo96) vol 96 pp226ndash231 1996

[15] AK Jain andRCDubesAlgorithms for ClusteringData Pren-tice-Hall Upper Saddle River NJ USA 1988

[16] D Pomerantz and G Dudek ldquoContext dependent movie re-commendations using a hierarchical Bayesian modelrdquo in Ad-vances in artificial intelligence vol 5549 of Lecture Notes inComputer Science pp 98ndash109 Springer Berlin Germany 2009

[17] C Basu H Hirsh and W Cohen ldquoRecommendation as classi-fication using social and content-based information in recom-mendationrdquo in Proceedings of the 15th National Conference onArtificial Intelligence (AAAI rsquo98) pp 714ndash720 July 1998

[18] ldquo23 clusteringmdashscikit-learn 014 documentationrdquo httpscikit-learnorgstablemodulesclusteringhtmlclustering

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 3: Research Article Precomputed Clustering for Movie ...downloads.hindawi.com/journals/jam/2014/742341.pdfResearch Article Precomputed Clustering for Movie Recommendation System in Real

Journal of Applied Mathematics 3

+

+++

+

++++

minus

C1 C2

C3

C4

Figure 2 A demo of voting The probability that recommendedmovies are in 119862

1is 58

on different properties of the movies including the moviestypes publish year countries of publishing companies lan-guages directors casts and duration time We first computethe Jaccard distance based onmovies types countries of pub-lishing companies languages directors and casts and thenthe distance based on publish year and duration time by thedistance we defined in Sections 422 and 423We then com-pute the overall distances between each movie sample bysummarizing them together with the weights which we ob-tained from the survey mentioned in Section 41

32 Preclustering System Preclustering system separatesmovies into different clusters before giving them to onlinerecommendation system In this system we tried fivekinds of different clustering algorithms affinity propagationDBSCAN hierarchical clustering spectral clustering andrandom generator among which spectral clustering is tooslow to get the result in demanded time Therefore we sentthe remaining four clustering results to the recommendationsystem with two data formats one can get the cluster label byone moviersquos IMDB identifier and another one can drag all themoviesrsquo IMDB identifiers by one of cluster labels In this waythe online recommendation system in Section 33 can quicklyget the information it wants

33 Real Time Online Recommendation System After gettingthe preclustering results the website read the history of usersrsquovotes on movies and used the majority of the voting togenerate recommendations Inmore detail if one user119880

119894likes

movie1198721198951which belongs to cluster 119862

1198961 119880119894will give 119862

1198961one

V119900119905119890 on 1198621198961 On contract if 119880

119894dislikes movie 119872

1198952in 1198621198962

119880119894will cancel a V119900119905119890 for 119862

1198962 If a cluster gets negative V119900119905119890

at the end it will return to 0 After 119880119894votes all the movies

heshe comments different clusters will get different votesThe movie recommendation system will recommend moviesaccording to the votes of clusters 119862

119895as 119881(119862

119895) as follows

119875 (119862119894) =

119881 (119862119894)

sum119899

119895=1119881(119862119895)

(1)

where 119875(119862119894) is the probability that a movie should be recom-

mended from cluster 119862119894and 119881(119862

119894) is the number of votes in

cluster 119862119894 119899 is the number of clusters As shown in Figure 2

To make sure that in our evaluation the reason that uservotes a movie we recommended as dislike is not because themovie itself is really bad we recommend movies with theIMDB score higher than 7 in the clusters which have alreadybeen voted out by the user

4 Distance Computation System

41 Weight Survey Moviesrsquo properties create a very specialspace where the weights of each dimension are treated com-pletely different For example the type of a movie is of coursemore important than the duration of it Therefore we cannotuse the default distance function provided by Scikit-learn [18]package that we use Instead the feature selection like in [2]or starting a survey to figure out the weights to different di-mensions should be implemented In other research areaswhere we human beings do not knowwhich feature is impor-tant such as features in carcinogenic we always use featureselection to decide the weights of different features In movierecommendation area however we can investigate it by thesurvey shown in Figure 3 since people knowwhy they like themovie By carefully designing a survey and letting users votethe top three factors we obtained a result shown in Figure 4

From the survey result shown in Figure 4 we can figureout that the weight of publish year is 00915 weight of countryis 0147 weight of type is 04167 weight of language is 00545weight of director is 00896 weight of casts is 01792 andweight of duration is 00214These weights will be used in theoverall distance computation in Section 43

42 Distance Function

421 Jaccard Distance Since for countries languages castsand types of the movie the distance is determined by howmany same and different items there are between sets wedecided to use Jaccard distance proposed in [12] Jaccard dis-tance is a statistic used for measuring dissimilarity betweensample sets The formula is shown in formula (2) Accordingto this formula we use intersection and union to do thecalculation and easily obtain the distance between eachmoviefeature mentioned above as follows

119869 (119860 119861) = 1 minus|119860 cap 119861|

|119860 cup 119861| (2)

422 Publish Year Distance The distance for publish yearbetweenmovies is a very special case for the human being Tohumanrsquos intuition the distance between movies published in1950s and 1960s seems much less than that between moviespublished in 2012 and 2013 In another word the distanceof publish year is not a linear function One reason forsuch phenomenon is that there are more and more moviespublished recently The statistic of number of movies in eachyear in our database is shown in Figure 5

Figure 5 illuminates that the number of publishedmoviesin each year is similar to an exponential function The trendline is shown as follows

119891 (119909) = 03584 sdot exp00598(119909minus1902) (3)

4 Journal of Applied Mathematics

Figure 3 Survey in our community

9

15

42

5

9

18

2

Publishing yearCountryTypeLanguage

DirectorActorDuration

system

Figure 4 Survey of top three liked movie factors

According to formula (3) we propose a new distance fun-ction for publish year PD(year

119894 year119895) shown in the follow-

ing

PD (year119894 year119895) =

intmax(year119894 year119895)min(year119894 year119895)

119891 (119909)

int2014

1902119891 (119909)

(4)

0

100

200

300

400

500

600

700

800

900

1880 1900 1920 1940 1960 1980 2000 2020

Count

Count

y = 03584e00598(xminus1902)

Figure 5 Statistic result of movie published in each year

423 Duration Distance Human being is very sensitive for asmall period of time which means that half an hour is nearlyas double time as a quarter of an hour in humanrsquos feelingTherefore we use a linear function shown in formula (5) tocompute the duration distance DUD(dur

119894 dur119895)

DUD (dur119894 dur119895) =

min (10038161003816100381610038161003816dur119894 minus dur119895

10038161003816100381610038161003816 200)

200 (5)

Here in order to normalize the distance we consider 200minutes as the maxima duration of a movie

Journal of Applied Mathematics 5

(1)

(2)

(3)

(4)

(5)

Figure 6 Matrix computation parallel demo

43 Overall Distance Webuild up a distancematrix [119899 119899] be-tween each movie in our database based on formula (6) asfollows

OD (119898119894 119898119895) = sum

119896isinmovie featuresweight

119896lowast distance

119896(119894 119895)

(6)

44 Parallel DistanceComputation Since our distancematrixis symmetric we can use such property to do distance com-putation in parallel

To do parallel computation two conditions must besatisfied

(1) Works assigned to different processing must be ap-proximately equal

(2) There is no interaction or data exchange between dif-ferent working processing

The distance matrix we need to build is a symmetric ma-trix in which all the elements on the main diagonal are equalto 0 formally defined as a matrix 119863 where 119863 = 119863

119879 and 119863[119894119894] = 0 as follows

119863 =

[[[[[[[[

[

0 12058201

12058202

sdot sdot sdot 1205820119899minus1

1205820119899

12058201

0 12058212

sdot sdot sdot 1205821119899minus1

1205821119899

12058202

12058212

0 sdot sdot sdot 1205822119899minus1

1205822119899

d

1205820119899minus1

1205821119899minus1

1205822119899minus1

sdot sdot sdot 0 120582119899minus1119899

1205820119899

1205821119899

1205822119899

sdot sdot sdot 120582119899minus1119899

0

]]]]]]]]

]

(7)

The scratch of the distancematrix shown in formula (7) isshown in Figure 6 Since the distancematrix is symmetric thedata in area (1) and area (2) and that in area (4) and area (5)are correspondingly equalTherefore instead of computing allof them we just compute the upper triangle of the matrixthat is the data in areas (1) and (5) However if we separatethe parallel tasks by rows of the matrix the works (numberof elements) in each task are (is) not equal (from 119899 minus 1 to 1)To overcome this problem the works needed in area (4) arefilled into area (3) so that we can group 119899 minus 1 unequal tasks

3

2

1

0

minus1

minus2

minus33210minus1minus2minus3

Figure 7 Affinity propagation demo [18]

into lceil1198992rceil equal tasks In this way a process pool where eachprocess computes distance in one task can be used to parallelthese lceil1198992rceil tasks

More specifically we firstly assign the main diagonal lineto 0 There is no need to compute the main diagonal sincewe know that the distance between one movie and itself is 0Then for each row 119894 we combine the distance computationsof119863[119899minus1minus 119894 119899minus 119894] 119863[119899minus1minus 119894 119899minus 119894+1] 119863[119899minus1minus 119894 119899minus2]119863[119899minus1minus119894 119899minus1] and those of119863[119894 119894+1] 119863[119894 119894+2] 119863[119894 119899minus2] 119863[119894 119899minus1] together as a task After computation we get anupper triangular matrix that is area (1) and area (4) Thenfill the lower triangle with the upper triangle that is let 119863 =

119863 + 119863119879

5 Preclustering System

In this section we use four clustering methods to investigatethe best cluster method which can be used in the final movierecommendation system The four clustering algorithms areaffinity propagation [13] DBSCAN [14] hierarchical cluster-ing [15] and random clustering for base line usage

51 Clustering Method

511 Affinity Propagation Affinity propagation which wasfirst proposed in [13] generates clusters by sending messagesbetween pairs of samples until convergence or changes fallingbelow a threshold A dataset is then described using a smallnumber of exemplars which are identified as those most re-presentative of other samples The messages sent betweenpairs represent the suitability for one sample to be the exem-plar of the other which is updated in response to the valuesfrom other pairsThis updating happens iteratively until con-vergence at which point the final exemplars are chosenand hence the final clustering is given as demonstrated inFigure 7

512 DBSCAN The DBSCAN algorithm [14] views clustersas areas of high density separated by areas of low density

6 Journal of Applied Mathematics

25

20

15

10

05

00

minus05

minus10

minus15

minus20minus20 2015100500minus05minus10minus15minus25

Figure 8 DBSCAN demo [18]

Due to this rather generic view clusters found by DBSCANcan be in any shape as opposed to k-means which assumesthat clusters are convex shapedThe central component to theDBSCAN is the concept of core samples which are samplesthat are in areas of high density A cluster is therefore a setof core samples each close to each other (measured by somedistancemeasure) and a set of non-core samples that are closeto a core sample (but are not themselves core samples) asdemonstrated in Figure 8

513 Hierarchical Clustering Hierarchical clustering [15] is ageneral family of clustering algorithms that build nested clus-ters bymerging them successivelyThis hierarchy of clusters isrepresented as a treeThe root of the tree is the unique clusterthat gathers all the samples the leaves being the clusters withonly one sample

The beauty of this clustering method is that we have noneed to provide the size of clustering ahead Instead we caninitialize with a large cut height and decrease the cut heightwhen the user has voted most movies in one cluster In thisway we can provide a larger cluster to fit userrsquos interests Asshown in Figure 9

514 Random Clustering We first fix the size of clustering to60 and then for each movie we randomly pick the cluster itbelongs to by generating a random number between 1 and 60

52 Preclustering System Output Because of seeking effi-ciency we output the clustering label results in two formatsOne can be used to check clustering labels by movie ID andanother is used to check all the moviesrsquo IDs in one clustergiven the cluster label By outputting the file in JSON we caneasily transfer data between preclustering system in Pythonand real time online recommendation system in PHP

6 Evaluation

Before the evaluation starts we created a survey and askedusers to select the top three movie features they care about as

Table 1 Comparison of results between different clustering meth-ods in first evaluation method

True positive False positive Accuracy rateAF 239 98 70920DB 174 77 69323HC 228 88 72152RA 125 104 54585

mentioned in Section 41 After one week of data collectionwe have the result which is shown in Figure 4 as mentioned

Based on the result we calculate the distance matrix andmake it as an input to the different clusters We implementedthe four clustering techniques by using python library Scikit-learn

61 Experimental Setup The evaluation has been conductedusing a university social website in China since the website isowned by one of our authors We add the recommendationfunction in a banner and put it on top of the movie service tolet users vote The overview voting component of the websiteis showed in Figure 10

There are three voting choices like do not like andunseen for each recommended movie Voting like scores 1voting do not like scores minus1 and voting unseen scores 0Because of the limited time of data collection we cannot waitfor users to watch the recommended movies Therefore weadd the unseen choices for users to select if they have not seenthe recommended movies and only consider the votes whichare like or do not like to evaluate our approach

62 Result After two weeks voting for themovie recommen-dation we have collected 168424 total votes 132360 votes arecollected in 12 days and used for recommendation majorityvoting as mentioned in Section 33 The remaining 6064votes are collected in two days after the recommendationsystem is deployed Since only the like vote and do not likevote are considered illegible we generated the results in thefollowing subsections

621 First Evaluation Result In the first evaluation resultlisted in Table 1 we calculated the true positive false positiveand accuracy for each clustering techniques True positiveequals the number of like votes and false positive equals thenumber of do not like votes Because most of the recom-mended movies are voted unseen there are a total of 1133votes counted and 216 users participated in the votingConsider the following

Accuracy = True PositiveTrue Positive + False Positive

(8)

As shown by the result in Table 1 the hierarchical clus-tering approach is the best However the accuracy of everyclustering approach is not satisfied Therefore we extractedthe usersrsquo voting history from our server database andanalyzed those data We found that some of the users madeless than 5 votes some of them made between 5 to 10 votes

Journal of Applied Mathematics 7

08

06

04

02

00

015

Hei

ght

X31

X37

X0 X20

X28

X38

X11

X35 X3 X29

X10

X36

X7 X24 X6 X3

0X2

5X5 X21

X19

X1 X13 X18

X14

X17

X15

X26 X12

X33 X8

X2X2

2X2

3 X32

X9 X27 X4 X34

X16

Figure 9 Hierarchical clustering demo

Beijing institute of technology

IMBD rating 71IMBD rating 71IMBD rating 71

IMBD rating 71 IMBD rating 85 IMBD rating 80

Isabella

Add

Voting

Bookmarkfind us on Weibo Blog Notifica-

tionsMessages Tasks Profiles logout

Forum Home Blog Search Support

like dislike unseen detailslike dislike unseen detailslike dislike unseen detailslike dislike unseen detailslike dislike unseen detailslike dislike unseen details

Figure 10 Website overview

There are a few users who made more than 20 votes In thosepeople who made more than 20 votes we found that 4 usersmade all their votes do not like Therefore almost one thirdof the false positive votes were made by only 4 users over 216users Table 3 shows one example from one typical user Thefirst column is the movie ID from imdb the second columnis the voting results and the third column is the voting timeWe think maybe this user dislikes our recommendation ideaso he made all the votes negative

There is another small portion of users made more than20 votes Most of the votes are negative Only a few votesare positive and no zero or unseen votes Table 4 shows oneof such examples It is probably because this user is votingnegative instead of the unseen choices for the unseenmoviessincemost of the users were voting unseenmore than like anddo not like

622 Second Evaluation Result These two situations men-tioned in Section 621 effect our result a lot in the first eval-uationTherefore wemade a second evaluation by calculatingeach userrsquos accuracy in formula (8) and then computing theaverage of all userrsquos accuracy Consider the following

Accuracy =sum accuracy

Total Number ofUsers (9)

Table 2 Comparison of results between different clustering meth-ods in second evaluation method

Accuracy rateAF 8095DB 8471HC 8203RA 6351

In the second evaluation we wait for another 6 hours tocollect more data There are 7100 votes in total 1420 votesare considered and 242 users participated in voting Basedon the second evaluation we obtained much better result asshown in Table 2 DBSCAN (density-based clustering) affin-ity propagation and hierarchical clustering obtained over80 accuracy DBSCAN reached to 8471

63 Comparison with Related Work In this section we com-pared our result with those of with several other movie re-commendation approachsystems Pomerantz and Dudek[16] used a hierarchical Bayesian approach combined withthe item similarity of the content for themovie recommenderand achieved the classifiers of 708 as the best result InBiancalana et alrsquos paper [11] the best result is 824 obtainedby combining three classifiers in a neural network Basu et al

8 Journal of Applied Mathematics

Table 3 Example of all negative votes

imdb Vote Last updatett1716777 minus1 2013-12-06 233655tt0183649 minus1 2013-12-06 233656tt1728196 minus1 2013-12-06 233658tt0014429 minus1 2013-12-06 233659tt0034248 minus1 2013-12-06 233701tt0017136 minus1 2013-12-06 233702tt0035169 minus1 2013-12-06 233703tt0023940 minus1 2013-12-06 233704tt0035209 minus1 2013-12-06 233705tt0047876 minus1 2013-12-06 233723tt0106535 minus1 2013-12-06 233724tt0121164 minus1 2013-12-06 233727tt0006864 minus1 2013-12-06 233728tt0499141 minus1 2013-12-06 233730tt0049010 minus1 2013-12-06 233732tt1298650 minus1 2013-12-06 234039tt0408236 minus1 2013-12-06 234040tt0162661 minus1 2013-12-06 234041tt0349903 minus1 2013-12-06 234043tt0114369 minus1 2013-12-06 234044tt0096895 minus1 2013-12-06 234045tt0111161 minus1 2013-12-06 234049tt2992146 minus1 2013-12-06 234053tt0082971 minus1 2013-12-06 234056tt1075419 minus1 2013-12-06 234058

[17] presented an inductive learning approach to recommen-dation and achieved an averaged precision of 83

Our best result (8471) achieved by DBSCAN [14] asshown in Table 2 improves 171 compared to best result ofrelated works mentioned above Meanwhile all these relatedworks are computed offline and they use the stored modelswhen testing and these models cannot be updated in timesince it takes too long to redo the learning process Our ap-proach on the other hand does not need to store usermodelsand it updates the recommendation result in real time bypreclustering movies and using a very simple learning ap-proach (majority voting) to provide movie recommendationSince movies are updated much less frequently than usersrsquovotes our approach is much more time efficient than relatedworks To sum up our research becomes significantly mean-ingful as it realizes the real timemodel updating while it doesnot compromise the accuracy

7 Conclusion and Future Work

In this paper we proposed a newdistance function for publishyear (year related) to compute the distance matrix and thenimplemented a real time movie recommendation systembased on content (movie) using preclustering and majorityvoting We implemented four different clustering techniquesin the preclustering process and obtained 8471 accuracy

Table 4 Example of nonzero votes

imdb Vote Last updatett0068767 minus1 2013-12-06 223330tt0109958 minus1 2013-12-06 223332tt0047034 minus1 2013-12-06 223333tt0397892 minus1 2013-12-06 223335tt0032551 minus1 2013-12-06 223336tt0110008 1 2013-12-06 223338tt0416881 minus1 2013-12-06 223340tt1233499 minus1 2013-12-06 223342tt1754691 minus1 2013-12-06 223345tt1034303 minus1 2013-12-06 223347tt0015324 minus1 2013-12-06 223348tt0049408 minus1 2013-12-06 223350tt0033467 minus1 2013-12-06 223351tt0000417 minus1 2013-12-06 223354tt1437358 minus1 2013-12-06 223356tt0063105 minus1 2013-12-06 223357tt0253474 minus1 2013-12-06 223358tt0034248 1 2013-12-06 223400tt0018578 1 2013-12-06 223404tt0335345 minus1 2013-12-06 223407tt0028445 1 2013-12-06 223410tt0040536 1 2013-12-06 223416tt0332379 minus1 2013-12-06 223419tt1728196 minus1 2013-12-06 223421

as the best result which is better than some of the otherresearch papersrsquo approachesWe realized a real time updatingmodel with no compromises on accuracy Our approach caneven work better with special user groups such as people incolleges universities or professional communities

The next step in this research is to make the systemadaptive to bewidely used bymany other types of groups suchas articles news and music and to obtain better accuracyOne of the limitations in our approach is that if there isa cluster that contains a lot of positive votes as well asa lot of negative ones we will not recommend movies inthis cluster However this situation may happen becauseof two totally different reasons One is that the user doesnot care about the movies in this cluster In this case norecommendation in this cluster is a wise choice Howeverif the user really likes the movies in this cluster and heshewatched a lot of movies in this cluster and some moviesheshe likes while some heshe dislikes ignoring this clusterin such situation will dissatisfy the user To conquer thislimitation we can give weights to the votes from users Inthe movie which is voted positive by the user and is votedpositive by many other people as well the weight of suchvote will be decreased while if the movie is voted positiveby the user but is voted negative by most of others thismeans such vote reveals the special taste from this user andgains higher weight and vice versa In a more specific waywe can set the weight of positive as 119873(119875 + 119873) and theweight of negative as 119875(119875 + 119873) where 119875 is the total positive

Journal of Applied Mathematics 9

votes from all users towards this movie and 119873 is the totalnegative votes from all users towards this movie In this waywe can still ignore the cluster in first situation while over-coming the limitation we had in the second situation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors especially thank Dr Khaled Rasheed for provid-ing them with suggestions and information regarding theirresearch The authors also thank Mr Hongfei Yan for the re-lated work and discussions

References

[1] M A Ghazanfar and A Prugel-Bennett ldquoAn improved switch-ing hybrid recommender system using Naive Bayes classifierand collaborative filteringrdquo in Proceedings of the InternationalMultiConference of Engineers and Computer Scientists (IMECSrsquo10) pp 493ndash502 March 2010

[2] A Bellogın I Cantador P Castells and A Ortigosa ldquoDis-covering relevant preferences in a personalised recommendersystemusingmachine learning techniquesrdquo inProceedings of theECML-PKDDWorkshop on Preference Learning 2008

[3] F-C LinH-WYu C-HHsu andT-CWeng ldquoRecommenda-tion system for localized products in vendingmachinesrdquo ExpertSystems with Applications vol 38 no 8 pp 9129ndash9138 2011

[4] K-J Kim and H Ahn ldquoA recommender system using GA K-means clustering in an online shoppingmarketrdquo Expert Systemswith Applications vol 34 no 2 pp 1200ndash1209 2008

[5] S B Aher and LM R J Lobo ldquoCombination ofmachine learn-ing algorithms for recommendation of courses in E-LearningSystem based on historical datardquoKnowledge-Based Systems vol51 pp 1ndash14 2013

[6] G M Dakhel and M Mahdavi ldquoProviding an effective col-laborative filtering algorithm based on distance measures andneighborsrsquo votingrdquo International Journal of Computer Informa-tion Systems and Industrial Management Applications vol 5 pp524ndash531 2013

[7] W H Ujwala V R Sheetal and M Debajyoti ldquoA hybrid webrecommendation system based on the improved associationrule mining algorithmrdquo Journal of Software Engineering andApplications vol 6 article 396 2013

[8] I Tatli and A Birturk ldquoA tag-based hybrid music recom-mendation system using semantic relations and multi-domaininformationrdquo in Proceedings of the 11th IEEE InternationalConference on Data Mining Workshops (ICDMW rsquo11) pp 548ndash554 December 2011

[9] Q Yang J Fan J Wang and L Zhou ldquoPersonalizing web pagerecommendation via collaborative filtering and topic-awareMarkov modelrdquo in Proceedings of the 10th IEEE InternationalConference on Data Mining (ICDM rsquo10) pp 1145ndash1150 Decem-ber 2010

[10] X Cai M Bain A Krzywicki et al ldquoCollaborative filtering forpeople to people recommendation in social networksrdquo in AI2010 Advances in Artificial Intelligence vol 6464 pp 476ndash485Springer Berlin Germany 2010

[11] C Biancalana F Gasparetti A Micarelli A Miola and GSansonetti ldquoContext-aware movie recommendation based onsignal processing and machine learningrdquo in Proceedings of the2nd Challenge on Context-Aware Movie Recommendation pp5ndash10 October 2011

[12] P Jaccard ldquoThe distribution of the flora in the alpine zone 1rdquoThe New Phytologist vol 11 no 2 pp 37ndash50 1912

[13] B J Frey and D Dueck ldquoClustering by passing messages be-tween data pointsrdquo Science vol 315 no 5814 pp 972ndash976 2007

[14] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining (KDD rsquo96) vol 96 pp226ndash231 1996

[15] AK Jain andRCDubesAlgorithms for ClusteringData Pren-tice-Hall Upper Saddle River NJ USA 1988

[16] D Pomerantz and G Dudek ldquoContext dependent movie re-commendations using a hierarchical Bayesian modelrdquo in Ad-vances in artificial intelligence vol 5549 of Lecture Notes inComputer Science pp 98ndash109 Springer Berlin Germany 2009

[17] C Basu H Hirsh and W Cohen ldquoRecommendation as classi-fication using social and content-based information in recom-mendationrdquo in Proceedings of the 15th National Conference onArtificial Intelligence (AAAI rsquo98) pp 714ndash720 July 1998

[18] ldquo23 clusteringmdashscikit-learn 014 documentationrdquo httpscikit-learnorgstablemodulesclusteringhtmlclustering

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 4: Research Article Precomputed Clustering for Movie ...downloads.hindawi.com/journals/jam/2014/742341.pdfResearch Article Precomputed Clustering for Movie Recommendation System in Real

4 Journal of Applied Mathematics

Figure 3 Survey in our community

9

15

42

5

9

18

2

Publishing yearCountryTypeLanguage

DirectorActorDuration

system

Figure 4 Survey of top three liked movie factors

According to formula (3) we propose a new distance fun-ction for publish year PD(year

119894 year119895) shown in the follow-

ing

PD (year119894 year119895) =

intmax(year119894 year119895)min(year119894 year119895)

119891 (119909)

int2014

1902119891 (119909)

(4)

0

100

200

300

400

500

600

700

800

900

1880 1900 1920 1940 1960 1980 2000 2020

Count

Count

y = 03584e00598(xminus1902)

Figure 5 Statistic result of movie published in each year

423 Duration Distance Human being is very sensitive for asmall period of time which means that half an hour is nearlyas double time as a quarter of an hour in humanrsquos feelingTherefore we use a linear function shown in formula (5) tocompute the duration distance DUD(dur

119894 dur119895)

DUD (dur119894 dur119895) =

min (10038161003816100381610038161003816dur119894 minus dur119895

10038161003816100381610038161003816 200)

200 (5)

Here in order to normalize the distance we consider 200minutes as the maxima duration of a movie

Journal of Applied Mathematics 5

(1)

(2)

(3)

(4)

(5)

Figure 6 Matrix computation parallel demo

43 Overall Distance Webuild up a distancematrix [119899 119899] be-tween each movie in our database based on formula (6) asfollows

OD (119898119894 119898119895) = sum

119896isinmovie featuresweight

119896lowast distance

119896(119894 119895)

(6)

44 Parallel DistanceComputation Since our distancematrixis symmetric we can use such property to do distance com-putation in parallel

To do parallel computation two conditions must besatisfied

(1) Works assigned to different processing must be ap-proximately equal

(2) There is no interaction or data exchange between dif-ferent working processing

The distance matrix we need to build is a symmetric ma-trix in which all the elements on the main diagonal are equalto 0 formally defined as a matrix 119863 where 119863 = 119863

119879 and 119863[119894119894] = 0 as follows

119863 =

[[[[[[[[

[

0 12058201

12058202

sdot sdot sdot 1205820119899minus1

1205820119899

12058201

0 12058212

sdot sdot sdot 1205821119899minus1

1205821119899

12058202

12058212

0 sdot sdot sdot 1205822119899minus1

1205822119899

d

1205820119899minus1

1205821119899minus1

1205822119899minus1

sdot sdot sdot 0 120582119899minus1119899

1205820119899

1205821119899

1205822119899

sdot sdot sdot 120582119899minus1119899

0

]]]]]]]]

]

(7)

The scratch of the distancematrix shown in formula (7) isshown in Figure 6 Since the distancematrix is symmetric thedata in area (1) and area (2) and that in area (4) and area (5)are correspondingly equalTherefore instead of computing allof them we just compute the upper triangle of the matrixthat is the data in areas (1) and (5) However if we separatethe parallel tasks by rows of the matrix the works (numberof elements) in each task are (is) not equal (from 119899 minus 1 to 1)To overcome this problem the works needed in area (4) arefilled into area (3) so that we can group 119899 minus 1 unequal tasks

3

2

1

0

minus1

minus2

minus33210minus1minus2minus3

Figure 7 Affinity propagation demo [18]

into lceil1198992rceil equal tasks In this way a process pool where eachprocess computes distance in one task can be used to parallelthese lceil1198992rceil tasks

More specifically we firstly assign the main diagonal lineto 0 There is no need to compute the main diagonal sincewe know that the distance between one movie and itself is 0Then for each row 119894 we combine the distance computationsof119863[119899minus1minus 119894 119899minus 119894] 119863[119899minus1minus 119894 119899minus 119894+1] 119863[119899minus1minus 119894 119899minus2]119863[119899minus1minus119894 119899minus1] and those of119863[119894 119894+1] 119863[119894 119894+2] 119863[119894 119899minus2] 119863[119894 119899minus1] together as a task After computation we get anupper triangular matrix that is area (1) and area (4) Thenfill the lower triangle with the upper triangle that is let 119863 =

119863 + 119863119879

5 Preclustering System

In this section we use four clustering methods to investigatethe best cluster method which can be used in the final movierecommendation system The four clustering algorithms areaffinity propagation [13] DBSCAN [14] hierarchical cluster-ing [15] and random clustering for base line usage

51 Clustering Method

511 Affinity Propagation Affinity propagation which wasfirst proposed in [13] generates clusters by sending messagesbetween pairs of samples until convergence or changes fallingbelow a threshold A dataset is then described using a smallnumber of exemplars which are identified as those most re-presentative of other samples The messages sent betweenpairs represent the suitability for one sample to be the exem-plar of the other which is updated in response to the valuesfrom other pairsThis updating happens iteratively until con-vergence at which point the final exemplars are chosenand hence the final clustering is given as demonstrated inFigure 7

512 DBSCAN The DBSCAN algorithm [14] views clustersas areas of high density separated by areas of low density

6 Journal of Applied Mathematics

25

20

15

10

05

00

minus05

minus10

minus15

minus20minus20 2015100500minus05minus10minus15minus25

Figure 8 DBSCAN demo [18]

Due to this rather generic view clusters found by DBSCANcan be in any shape as opposed to k-means which assumesthat clusters are convex shapedThe central component to theDBSCAN is the concept of core samples which are samplesthat are in areas of high density A cluster is therefore a setof core samples each close to each other (measured by somedistancemeasure) and a set of non-core samples that are closeto a core sample (but are not themselves core samples) asdemonstrated in Figure 8

513 Hierarchical Clustering Hierarchical clustering [15] is ageneral family of clustering algorithms that build nested clus-ters bymerging them successivelyThis hierarchy of clusters isrepresented as a treeThe root of the tree is the unique clusterthat gathers all the samples the leaves being the clusters withonly one sample

The beauty of this clustering method is that we have noneed to provide the size of clustering ahead Instead we caninitialize with a large cut height and decrease the cut heightwhen the user has voted most movies in one cluster In thisway we can provide a larger cluster to fit userrsquos interests Asshown in Figure 9

514 Random Clustering We first fix the size of clustering to60 and then for each movie we randomly pick the cluster itbelongs to by generating a random number between 1 and 60

52 Preclustering System Output Because of seeking effi-ciency we output the clustering label results in two formatsOne can be used to check clustering labels by movie ID andanother is used to check all the moviesrsquo IDs in one clustergiven the cluster label By outputting the file in JSON we caneasily transfer data between preclustering system in Pythonand real time online recommendation system in PHP

6 Evaluation

Before the evaluation starts we created a survey and askedusers to select the top three movie features they care about as

Table 1 Comparison of results between different clustering meth-ods in first evaluation method

True positive False positive Accuracy rateAF 239 98 70920DB 174 77 69323HC 228 88 72152RA 125 104 54585

mentioned in Section 41 After one week of data collectionwe have the result which is shown in Figure 4 as mentioned

Based on the result we calculate the distance matrix andmake it as an input to the different clusters We implementedthe four clustering techniques by using python library Scikit-learn

61 Experimental Setup The evaluation has been conductedusing a university social website in China since the website isowned by one of our authors We add the recommendationfunction in a banner and put it on top of the movie service tolet users vote The overview voting component of the websiteis showed in Figure 10

There are three voting choices like do not like andunseen for each recommended movie Voting like scores 1voting do not like scores minus1 and voting unseen scores 0Because of the limited time of data collection we cannot waitfor users to watch the recommended movies Therefore weadd the unseen choices for users to select if they have not seenthe recommended movies and only consider the votes whichare like or do not like to evaluate our approach

62 Result After two weeks voting for themovie recommen-dation we have collected 168424 total votes 132360 votes arecollected in 12 days and used for recommendation majorityvoting as mentioned in Section 33 The remaining 6064votes are collected in two days after the recommendationsystem is deployed Since only the like vote and do not likevote are considered illegible we generated the results in thefollowing subsections

621 First Evaluation Result In the first evaluation resultlisted in Table 1 we calculated the true positive false positiveand accuracy for each clustering techniques True positiveequals the number of like votes and false positive equals thenumber of do not like votes Because most of the recom-mended movies are voted unseen there are a total of 1133votes counted and 216 users participated in the votingConsider the following

Accuracy = True PositiveTrue Positive + False Positive

(8)

As shown by the result in Table 1 the hierarchical clus-tering approach is the best However the accuracy of everyclustering approach is not satisfied Therefore we extractedthe usersrsquo voting history from our server database andanalyzed those data We found that some of the users madeless than 5 votes some of them made between 5 to 10 votes

Journal of Applied Mathematics 7

08

06

04

02

00

015

Hei

ght

X31

X37

X0 X20

X28

X38

X11

X35 X3 X29

X10

X36

X7 X24 X6 X3

0X2

5X5 X21

X19

X1 X13 X18

X14

X17

X15

X26 X12

X33 X8

X2X2

2X2

3 X32

X9 X27 X4 X34

X16

Figure 9 Hierarchical clustering demo

Beijing institute of technology

IMBD rating 71IMBD rating 71IMBD rating 71

IMBD rating 71 IMBD rating 85 IMBD rating 80

Isabella

Add

Voting

Bookmarkfind us on Weibo Blog Notifica-

tionsMessages Tasks Profiles logout

Forum Home Blog Search Support

like dislike unseen detailslike dislike unseen detailslike dislike unseen detailslike dislike unseen detailslike dislike unseen detailslike dislike unseen details

Figure 10 Website overview

There are a few users who made more than 20 votes In thosepeople who made more than 20 votes we found that 4 usersmade all their votes do not like Therefore almost one thirdof the false positive votes were made by only 4 users over 216users Table 3 shows one example from one typical user Thefirst column is the movie ID from imdb the second columnis the voting results and the third column is the voting timeWe think maybe this user dislikes our recommendation ideaso he made all the votes negative

There is another small portion of users made more than20 votes Most of the votes are negative Only a few votesare positive and no zero or unseen votes Table 4 shows oneof such examples It is probably because this user is votingnegative instead of the unseen choices for the unseenmoviessincemost of the users were voting unseenmore than like anddo not like

622 Second Evaluation Result These two situations men-tioned in Section 621 effect our result a lot in the first eval-uationTherefore wemade a second evaluation by calculatingeach userrsquos accuracy in formula (8) and then computing theaverage of all userrsquos accuracy Consider the following

Accuracy =sum accuracy

Total Number ofUsers (9)

Table 2 Comparison of results between different clustering meth-ods in second evaluation method

Accuracy rateAF 8095DB 8471HC 8203RA 6351

In the second evaluation we wait for another 6 hours tocollect more data There are 7100 votes in total 1420 votesare considered and 242 users participated in voting Basedon the second evaluation we obtained much better result asshown in Table 2 DBSCAN (density-based clustering) affin-ity propagation and hierarchical clustering obtained over80 accuracy DBSCAN reached to 8471

63 Comparison with Related Work In this section we com-pared our result with those of with several other movie re-commendation approachsystems Pomerantz and Dudek[16] used a hierarchical Bayesian approach combined withthe item similarity of the content for themovie recommenderand achieved the classifiers of 708 as the best result InBiancalana et alrsquos paper [11] the best result is 824 obtainedby combining three classifiers in a neural network Basu et al

8 Journal of Applied Mathematics

Table 3 Example of all negative votes

imdb Vote Last updatett1716777 minus1 2013-12-06 233655tt0183649 minus1 2013-12-06 233656tt1728196 minus1 2013-12-06 233658tt0014429 minus1 2013-12-06 233659tt0034248 minus1 2013-12-06 233701tt0017136 minus1 2013-12-06 233702tt0035169 minus1 2013-12-06 233703tt0023940 minus1 2013-12-06 233704tt0035209 minus1 2013-12-06 233705tt0047876 minus1 2013-12-06 233723tt0106535 minus1 2013-12-06 233724tt0121164 minus1 2013-12-06 233727tt0006864 minus1 2013-12-06 233728tt0499141 minus1 2013-12-06 233730tt0049010 minus1 2013-12-06 233732tt1298650 minus1 2013-12-06 234039tt0408236 minus1 2013-12-06 234040tt0162661 minus1 2013-12-06 234041tt0349903 minus1 2013-12-06 234043tt0114369 minus1 2013-12-06 234044tt0096895 minus1 2013-12-06 234045tt0111161 minus1 2013-12-06 234049tt2992146 minus1 2013-12-06 234053tt0082971 minus1 2013-12-06 234056tt1075419 minus1 2013-12-06 234058

[17] presented an inductive learning approach to recommen-dation and achieved an averaged precision of 83

Our best result (8471) achieved by DBSCAN [14] asshown in Table 2 improves 171 compared to best result ofrelated works mentioned above Meanwhile all these relatedworks are computed offline and they use the stored modelswhen testing and these models cannot be updated in timesince it takes too long to redo the learning process Our ap-proach on the other hand does not need to store usermodelsand it updates the recommendation result in real time bypreclustering movies and using a very simple learning ap-proach (majority voting) to provide movie recommendationSince movies are updated much less frequently than usersrsquovotes our approach is much more time efficient than relatedworks To sum up our research becomes significantly mean-ingful as it realizes the real timemodel updating while it doesnot compromise the accuracy

7 Conclusion and Future Work

In this paper we proposed a newdistance function for publishyear (year related) to compute the distance matrix and thenimplemented a real time movie recommendation systembased on content (movie) using preclustering and majorityvoting We implemented four different clustering techniquesin the preclustering process and obtained 8471 accuracy

Table 4 Example of nonzero votes

imdb Vote Last updatett0068767 minus1 2013-12-06 223330tt0109958 minus1 2013-12-06 223332tt0047034 minus1 2013-12-06 223333tt0397892 minus1 2013-12-06 223335tt0032551 minus1 2013-12-06 223336tt0110008 1 2013-12-06 223338tt0416881 minus1 2013-12-06 223340tt1233499 minus1 2013-12-06 223342tt1754691 minus1 2013-12-06 223345tt1034303 minus1 2013-12-06 223347tt0015324 minus1 2013-12-06 223348tt0049408 minus1 2013-12-06 223350tt0033467 minus1 2013-12-06 223351tt0000417 minus1 2013-12-06 223354tt1437358 minus1 2013-12-06 223356tt0063105 minus1 2013-12-06 223357tt0253474 minus1 2013-12-06 223358tt0034248 1 2013-12-06 223400tt0018578 1 2013-12-06 223404tt0335345 minus1 2013-12-06 223407tt0028445 1 2013-12-06 223410tt0040536 1 2013-12-06 223416tt0332379 minus1 2013-12-06 223419tt1728196 minus1 2013-12-06 223421

as the best result which is better than some of the otherresearch papersrsquo approachesWe realized a real time updatingmodel with no compromises on accuracy Our approach caneven work better with special user groups such as people incolleges universities or professional communities

The next step in this research is to make the systemadaptive to bewidely used bymany other types of groups suchas articles news and music and to obtain better accuracyOne of the limitations in our approach is that if there isa cluster that contains a lot of positive votes as well asa lot of negative ones we will not recommend movies inthis cluster However this situation may happen becauseof two totally different reasons One is that the user doesnot care about the movies in this cluster In this case norecommendation in this cluster is a wise choice Howeverif the user really likes the movies in this cluster and heshewatched a lot of movies in this cluster and some moviesheshe likes while some heshe dislikes ignoring this clusterin such situation will dissatisfy the user To conquer thislimitation we can give weights to the votes from users Inthe movie which is voted positive by the user and is votedpositive by many other people as well the weight of suchvote will be decreased while if the movie is voted positiveby the user but is voted negative by most of others thismeans such vote reveals the special taste from this user andgains higher weight and vice versa In a more specific waywe can set the weight of positive as 119873(119875 + 119873) and theweight of negative as 119875(119875 + 119873) where 119875 is the total positive

Journal of Applied Mathematics 9

votes from all users towards this movie and 119873 is the totalnegative votes from all users towards this movie In this waywe can still ignore the cluster in first situation while over-coming the limitation we had in the second situation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors especially thank Dr Khaled Rasheed for provid-ing them with suggestions and information regarding theirresearch The authors also thank Mr Hongfei Yan for the re-lated work and discussions

References

[1] M A Ghazanfar and A Prugel-Bennett ldquoAn improved switch-ing hybrid recommender system using Naive Bayes classifierand collaborative filteringrdquo in Proceedings of the InternationalMultiConference of Engineers and Computer Scientists (IMECSrsquo10) pp 493ndash502 March 2010

[2] A Bellogın I Cantador P Castells and A Ortigosa ldquoDis-covering relevant preferences in a personalised recommendersystemusingmachine learning techniquesrdquo inProceedings of theECML-PKDDWorkshop on Preference Learning 2008

[3] F-C LinH-WYu C-HHsu andT-CWeng ldquoRecommenda-tion system for localized products in vendingmachinesrdquo ExpertSystems with Applications vol 38 no 8 pp 9129ndash9138 2011

[4] K-J Kim and H Ahn ldquoA recommender system using GA K-means clustering in an online shoppingmarketrdquo Expert Systemswith Applications vol 34 no 2 pp 1200ndash1209 2008

[5] S B Aher and LM R J Lobo ldquoCombination ofmachine learn-ing algorithms for recommendation of courses in E-LearningSystem based on historical datardquoKnowledge-Based Systems vol51 pp 1ndash14 2013

[6] G M Dakhel and M Mahdavi ldquoProviding an effective col-laborative filtering algorithm based on distance measures andneighborsrsquo votingrdquo International Journal of Computer Informa-tion Systems and Industrial Management Applications vol 5 pp524ndash531 2013

[7] W H Ujwala V R Sheetal and M Debajyoti ldquoA hybrid webrecommendation system based on the improved associationrule mining algorithmrdquo Journal of Software Engineering andApplications vol 6 article 396 2013

[8] I Tatli and A Birturk ldquoA tag-based hybrid music recom-mendation system using semantic relations and multi-domaininformationrdquo in Proceedings of the 11th IEEE InternationalConference on Data Mining Workshops (ICDMW rsquo11) pp 548ndash554 December 2011

[9] Q Yang J Fan J Wang and L Zhou ldquoPersonalizing web pagerecommendation via collaborative filtering and topic-awareMarkov modelrdquo in Proceedings of the 10th IEEE InternationalConference on Data Mining (ICDM rsquo10) pp 1145ndash1150 Decem-ber 2010

[10] X Cai M Bain A Krzywicki et al ldquoCollaborative filtering forpeople to people recommendation in social networksrdquo in AI2010 Advances in Artificial Intelligence vol 6464 pp 476ndash485Springer Berlin Germany 2010

[11] C Biancalana F Gasparetti A Micarelli A Miola and GSansonetti ldquoContext-aware movie recommendation based onsignal processing and machine learningrdquo in Proceedings of the2nd Challenge on Context-Aware Movie Recommendation pp5ndash10 October 2011

[12] P Jaccard ldquoThe distribution of the flora in the alpine zone 1rdquoThe New Phytologist vol 11 no 2 pp 37ndash50 1912

[13] B J Frey and D Dueck ldquoClustering by passing messages be-tween data pointsrdquo Science vol 315 no 5814 pp 972ndash976 2007

[14] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining (KDD rsquo96) vol 96 pp226ndash231 1996

[15] AK Jain andRCDubesAlgorithms for ClusteringData Pren-tice-Hall Upper Saddle River NJ USA 1988

[16] D Pomerantz and G Dudek ldquoContext dependent movie re-commendations using a hierarchical Bayesian modelrdquo in Ad-vances in artificial intelligence vol 5549 of Lecture Notes inComputer Science pp 98ndash109 Springer Berlin Germany 2009

[17] C Basu H Hirsh and W Cohen ldquoRecommendation as classi-fication using social and content-based information in recom-mendationrdquo in Proceedings of the 15th National Conference onArtificial Intelligence (AAAI rsquo98) pp 714ndash720 July 1998

[18] ldquo23 clusteringmdashscikit-learn 014 documentationrdquo httpscikit-learnorgstablemodulesclusteringhtmlclustering

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 5: Research Article Precomputed Clustering for Movie ...downloads.hindawi.com/journals/jam/2014/742341.pdfResearch Article Precomputed Clustering for Movie Recommendation System in Real

Journal of Applied Mathematics 5

(1)

(2)

(3)

(4)

(5)

Figure 6 Matrix computation parallel demo

43 Overall Distance Webuild up a distancematrix [119899 119899] be-tween each movie in our database based on formula (6) asfollows

OD (119898119894 119898119895) = sum

119896isinmovie featuresweight

119896lowast distance

119896(119894 119895)

(6)

44 Parallel DistanceComputation Since our distancematrixis symmetric we can use such property to do distance com-putation in parallel

To do parallel computation two conditions must besatisfied

(1) Works assigned to different processing must be ap-proximately equal

(2) There is no interaction or data exchange between dif-ferent working processing

The distance matrix we need to build is a symmetric ma-trix in which all the elements on the main diagonal are equalto 0 formally defined as a matrix 119863 where 119863 = 119863

119879 and 119863[119894119894] = 0 as follows

119863 =

[[[[[[[[

[

0 12058201

12058202

sdot sdot sdot 1205820119899minus1

1205820119899

12058201

0 12058212

sdot sdot sdot 1205821119899minus1

1205821119899

12058202

12058212

0 sdot sdot sdot 1205822119899minus1

1205822119899

d

1205820119899minus1

1205821119899minus1

1205822119899minus1

sdot sdot sdot 0 120582119899minus1119899

1205820119899

1205821119899

1205822119899

sdot sdot sdot 120582119899minus1119899

0

]]]]]]]]

]

(7)

The scratch of the distancematrix shown in formula (7) isshown in Figure 6 Since the distancematrix is symmetric thedata in area (1) and area (2) and that in area (4) and area (5)are correspondingly equalTherefore instead of computing allof them we just compute the upper triangle of the matrixthat is the data in areas (1) and (5) However if we separatethe parallel tasks by rows of the matrix the works (numberof elements) in each task are (is) not equal (from 119899 minus 1 to 1)To overcome this problem the works needed in area (4) arefilled into area (3) so that we can group 119899 minus 1 unequal tasks

3

2

1

0

minus1

minus2

minus33210minus1minus2minus3

Figure 7 Affinity propagation demo [18]

into lceil1198992rceil equal tasks In this way a process pool where eachprocess computes distance in one task can be used to parallelthese lceil1198992rceil tasks

More specifically we firstly assign the main diagonal lineto 0 There is no need to compute the main diagonal sincewe know that the distance between one movie and itself is 0Then for each row 119894 we combine the distance computationsof119863[119899minus1minus 119894 119899minus 119894] 119863[119899minus1minus 119894 119899minus 119894+1] 119863[119899minus1minus 119894 119899minus2]119863[119899minus1minus119894 119899minus1] and those of119863[119894 119894+1] 119863[119894 119894+2] 119863[119894 119899minus2] 119863[119894 119899minus1] together as a task After computation we get anupper triangular matrix that is area (1) and area (4) Thenfill the lower triangle with the upper triangle that is let 119863 =

119863 + 119863119879

5 Preclustering System

In this section we use four clustering methods to investigatethe best cluster method which can be used in the final movierecommendation system The four clustering algorithms areaffinity propagation [13] DBSCAN [14] hierarchical cluster-ing [15] and random clustering for base line usage

51 Clustering Method

511 Affinity Propagation Affinity propagation which wasfirst proposed in [13] generates clusters by sending messagesbetween pairs of samples until convergence or changes fallingbelow a threshold A dataset is then described using a smallnumber of exemplars which are identified as those most re-presentative of other samples The messages sent betweenpairs represent the suitability for one sample to be the exem-plar of the other which is updated in response to the valuesfrom other pairsThis updating happens iteratively until con-vergence at which point the final exemplars are chosenand hence the final clustering is given as demonstrated inFigure 7

512 DBSCAN The DBSCAN algorithm [14] views clustersas areas of high density separated by areas of low density

6 Journal of Applied Mathematics

25

20

15

10

05

00

minus05

minus10

minus15

minus20minus20 2015100500minus05minus10minus15minus25

Figure 8 DBSCAN demo [18]

Due to this rather generic view clusters found by DBSCANcan be in any shape as opposed to k-means which assumesthat clusters are convex shapedThe central component to theDBSCAN is the concept of core samples which are samplesthat are in areas of high density A cluster is therefore a setof core samples each close to each other (measured by somedistancemeasure) and a set of non-core samples that are closeto a core sample (but are not themselves core samples) asdemonstrated in Figure 8

513 Hierarchical Clustering Hierarchical clustering [15] is ageneral family of clustering algorithms that build nested clus-ters bymerging them successivelyThis hierarchy of clusters isrepresented as a treeThe root of the tree is the unique clusterthat gathers all the samples the leaves being the clusters withonly one sample

The beauty of this clustering method is that we have noneed to provide the size of clustering ahead Instead we caninitialize with a large cut height and decrease the cut heightwhen the user has voted most movies in one cluster In thisway we can provide a larger cluster to fit userrsquos interests Asshown in Figure 9

514 Random Clustering We first fix the size of clustering to60 and then for each movie we randomly pick the cluster itbelongs to by generating a random number between 1 and 60

52 Preclustering System Output Because of seeking effi-ciency we output the clustering label results in two formatsOne can be used to check clustering labels by movie ID andanother is used to check all the moviesrsquo IDs in one clustergiven the cluster label By outputting the file in JSON we caneasily transfer data between preclustering system in Pythonand real time online recommendation system in PHP

6 Evaluation

Before the evaluation starts we created a survey and askedusers to select the top three movie features they care about as

Table 1 Comparison of results between different clustering meth-ods in first evaluation method

True positive False positive Accuracy rateAF 239 98 70920DB 174 77 69323HC 228 88 72152RA 125 104 54585

mentioned in Section 41 After one week of data collectionwe have the result which is shown in Figure 4 as mentioned

Based on the result we calculate the distance matrix andmake it as an input to the different clusters We implementedthe four clustering techniques by using python library Scikit-learn

61 Experimental Setup The evaluation has been conductedusing a university social website in China since the website isowned by one of our authors We add the recommendationfunction in a banner and put it on top of the movie service tolet users vote The overview voting component of the websiteis showed in Figure 10

There are three voting choices like do not like andunseen for each recommended movie Voting like scores 1voting do not like scores minus1 and voting unseen scores 0Because of the limited time of data collection we cannot waitfor users to watch the recommended movies Therefore weadd the unseen choices for users to select if they have not seenthe recommended movies and only consider the votes whichare like or do not like to evaluate our approach

62 Result After two weeks voting for themovie recommen-dation we have collected 168424 total votes 132360 votes arecollected in 12 days and used for recommendation majorityvoting as mentioned in Section 33 The remaining 6064votes are collected in two days after the recommendationsystem is deployed Since only the like vote and do not likevote are considered illegible we generated the results in thefollowing subsections

621 First Evaluation Result In the first evaluation resultlisted in Table 1 we calculated the true positive false positiveand accuracy for each clustering techniques True positiveequals the number of like votes and false positive equals thenumber of do not like votes Because most of the recom-mended movies are voted unseen there are a total of 1133votes counted and 216 users participated in the votingConsider the following

Accuracy = True PositiveTrue Positive + False Positive

(8)

As shown by the result in Table 1 the hierarchical clus-tering approach is the best However the accuracy of everyclustering approach is not satisfied Therefore we extractedthe usersrsquo voting history from our server database andanalyzed those data We found that some of the users madeless than 5 votes some of them made between 5 to 10 votes

Journal of Applied Mathematics 7

08

06

04

02

00

015

Hei

ght

X31

X37

X0 X20

X28

X38

X11

X35 X3 X29

X10

X36

X7 X24 X6 X3

0X2

5X5 X21

X19

X1 X13 X18

X14

X17

X15

X26 X12

X33 X8

X2X2

2X2

3 X32

X9 X27 X4 X34

X16

Figure 9 Hierarchical clustering demo

Beijing institute of technology

IMBD rating 71IMBD rating 71IMBD rating 71

IMBD rating 71 IMBD rating 85 IMBD rating 80

Isabella

Add

Voting

Bookmarkfind us on Weibo Blog Notifica-

tionsMessages Tasks Profiles logout

Forum Home Blog Search Support

like dislike unseen detailslike dislike unseen detailslike dislike unseen detailslike dislike unseen detailslike dislike unseen detailslike dislike unseen details

Figure 10 Website overview

There are a few users who made more than 20 votes In thosepeople who made more than 20 votes we found that 4 usersmade all their votes do not like Therefore almost one thirdof the false positive votes were made by only 4 users over 216users Table 3 shows one example from one typical user Thefirst column is the movie ID from imdb the second columnis the voting results and the third column is the voting timeWe think maybe this user dislikes our recommendation ideaso he made all the votes negative

There is another small portion of users made more than20 votes Most of the votes are negative Only a few votesare positive and no zero or unseen votes Table 4 shows oneof such examples It is probably because this user is votingnegative instead of the unseen choices for the unseenmoviessincemost of the users were voting unseenmore than like anddo not like

622 Second Evaluation Result These two situations men-tioned in Section 621 effect our result a lot in the first eval-uationTherefore wemade a second evaluation by calculatingeach userrsquos accuracy in formula (8) and then computing theaverage of all userrsquos accuracy Consider the following

Accuracy =sum accuracy

Total Number ofUsers (9)

Table 2 Comparison of results between different clustering meth-ods in second evaluation method

Accuracy rateAF 8095DB 8471HC 8203RA 6351

In the second evaluation we wait for another 6 hours tocollect more data There are 7100 votes in total 1420 votesare considered and 242 users participated in voting Basedon the second evaluation we obtained much better result asshown in Table 2 DBSCAN (density-based clustering) affin-ity propagation and hierarchical clustering obtained over80 accuracy DBSCAN reached to 8471

63 Comparison with Related Work In this section we com-pared our result with those of with several other movie re-commendation approachsystems Pomerantz and Dudek[16] used a hierarchical Bayesian approach combined withthe item similarity of the content for themovie recommenderand achieved the classifiers of 708 as the best result InBiancalana et alrsquos paper [11] the best result is 824 obtainedby combining three classifiers in a neural network Basu et al

8 Journal of Applied Mathematics

Table 3 Example of all negative votes

imdb Vote Last updatett1716777 minus1 2013-12-06 233655tt0183649 minus1 2013-12-06 233656tt1728196 minus1 2013-12-06 233658tt0014429 minus1 2013-12-06 233659tt0034248 minus1 2013-12-06 233701tt0017136 minus1 2013-12-06 233702tt0035169 minus1 2013-12-06 233703tt0023940 minus1 2013-12-06 233704tt0035209 minus1 2013-12-06 233705tt0047876 minus1 2013-12-06 233723tt0106535 minus1 2013-12-06 233724tt0121164 minus1 2013-12-06 233727tt0006864 minus1 2013-12-06 233728tt0499141 minus1 2013-12-06 233730tt0049010 minus1 2013-12-06 233732tt1298650 minus1 2013-12-06 234039tt0408236 minus1 2013-12-06 234040tt0162661 minus1 2013-12-06 234041tt0349903 minus1 2013-12-06 234043tt0114369 minus1 2013-12-06 234044tt0096895 minus1 2013-12-06 234045tt0111161 minus1 2013-12-06 234049tt2992146 minus1 2013-12-06 234053tt0082971 minus1 2013-12-06 234056tt1075419 minus1 2013-12-06 234058

[17] presented an inductive learning approach to recommen-dation and achieved an averaged precision of 83

Our best result (8471) achieved by DBSCAN [14] asshown in Table 2 improves 171 compared to best result ofrelated works mentioned above Meanwhile all these relatedworks are computed offline and they use the stored modelswhen testing and these models cannot be updated in timesince it takes too long to redo the learning process Our ap-proach on the other hand does not need to store usermodelsand it updates the recommendation result in real time bypreclustering movies and using a very simple learning ap-proach (majority voting) to provide movie recommendationSince movies are updated much less frequently than usersrsquovotes our approach is much more time efficient than relatedworks To sum up our research becomes significantly mean-ingful as it realizes the real timemodel updating while it doesnot compromise the accuracy

7 Conclusion and Future Work

In this paper we proposed a newdistance function for publishyear (year related) to compute the distance matrix and thenimplemented a real time movie recommendation systembased on content (movie) using preclustering and majorityvoting We implemented four different clustering techniquesin the preclustering process and obtained 8471 accuracy

Table 4 Example of nonzero votes

imdb Vote Last updatett0068767 minus1 2013-12-06 223330tt0109958 minus1 2013-12-06 223332tt0047034 minus1 2013-12-06 223333tt0397892 minus1 2013-12-06 223335tt0032551 minus1 2013-12-06 223336tt0110008 1 2013-12-06 223338tt0416881 minus1 2013-12-06 223340tt1233499 minus1 2013-12-06 223342tt1754691 minus1 2013-12-06 223345tt1034303 minus1 2013-12-06 223347tt0015324 minus1 2013-12-06 223348tt0049408 minus1 2013-12-06 223350tt0033467 minus1 2013-12-06 223351tt0000417 minus1 2013-12-06 223354tt1437358 minus1 2013-12-06 223356tt0063105 minus1 2013-12-06 223357tt0253474 minus1 2013-12-06 223358tt0034248 1 2013-12-06 223400tt0018578 1 2013-12-06 223404tt0335345 minus1 2013-12-06 223407tt0028445 1 2013-12-06 223410tt0040536 1 2013-12-06 223416tt0332379 minus1 2013-12-06 223419tt1728196 minus1 2013-12-06 223421

as the best result which is better than some of the otherresearch papersrsquo approachesWe realized a real time updatingmodel with no compromises on accuracy Our approach caneven work better with special user groups such as people incolleges universities or professional communities

The next step in this research is to make the systemadaptive to bewidely used bymany other types of groups suchas articles news and music and to obtain better accuracyOne of the limitations in our approach is that if there isa cluster that contains a lot of positive votes as well asa lot of negative ones we will not recommend movies inthis cluster However this situation may happen becauseof two totally different reasons One is that the user doesnot care about the movies in this cluster In this case norecommendation in this cluster is a wise choice Howeverif the user really likes the movies in this cluster and heshewatched a lot of movies in this cluster and some moviesheshe likes while some heshe dislikes ignoring this clusterin such situation will dissatisfy the user To conquer thislimitation we can give weights to the votes from users Inthe movie which is voted positive by the user and is votedpositive by many other people as well the weight of suchvote will be decreased while if the movie is voted positiveby the user but is voted negative by most of others thismeans such vote reveals the special taste from this user andgains higher weight and vice versa In a more specific waywe can set the weight of positive as 119873(119875 + 119873) and theweight of negative as 119875(119875 + 119873) where 119875 is the total positive

Journal of Applied Mathematics 9

votes from all users towards this movie and 119873 is the totalnegative votes from all users towards this movie In this waywe can still ignore the cluster in first situation while over-coming the limitation we had in the second situation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors especially thank Dr Khaled Rasheed for provid-ing them with suggestions and information regarding theirresearch The authors also thank Mr Hongfei Yan for the re-lated work and discussions

References

[1] M A Ghazanfar and A Prugel-Bennett ldquoAn improved switch-ing hybrid recommender system using Naive Bayes classifierand collaborative filteringrdquo in Proceedings of the InternationalMultiConference of Engineers and Computer Scientists (IMECSrsquo10) pp 493ndash502 March 2010

[2] A Bellogın I Cantador P Castells and A Ortigosa ldquoDis-covering relevant preferences in a personalised recommendersystemusingmachine learning techniquesrdquo inProceedings of theECML-PKDDWorkshop on Preference Learning 2008

[3] F-C LinH-WYu C-HHsu andT-CWeng ldquoRecommenda-tion system for localized products in vendingmachinesrdquo ExpertSystems with Applications vol 38 no 8 pp 9129ndash9138 2011

[4] K-J Kim and H Ahn ldquoA recommender system using GA K-means clustering in an online shoppingmarketrdquo Expert Systemswith Applications vol 34 no 2 pp 1200ndash1209 2008

[5] S B Aher and LM R J Lobo ldquoCombination ofmachine learn-ing algorithms for recommendation of courses in E-LearningSystem based on historical datardquoKnowledge-Based Systems vol51 pp 1ndash14 2013

[6] G M Dakhel and M Mahdavi ldquoProviding an effective col-laborative filtering algorithm based on distance measures andneighborsrsquo votingrdquo International Journal of Computer Informa-tion Systems and Industrial Management Applications vol 5 pp524ndash531 2013

[7] W H Ujwala V R Sheetal and M Debajyoti ldquoA hybrid webrecommendation system based on the improved associationrule mining algorithmrdquo Journal of Software Engineering andApplications vol 6 article 396 2013

[8] I Tatli and A Birturk ldquoA tag-based hybrid music recom-mendation system using semantic relations and multi-domaininformationrdquo in Proceedings of the 11th IEEE InternationalConference on Data Mining Workshops (ICDMW rsquo11) pp 548ndash554 December 2011

[9] Q Yang J Fan J Wang and L Zhou ldquoPersonalizing web pagerecommendation via collaborative filtering and topic-awareMarkov modelrdquo in Proceedings of the 10th IEEE InternationalConference on Data Mining (ICDM rsquo10) pp 1145ndash1150 Decem-ber 2010

[10] X Cai M Bain A Krzywicki et al ldquoCollaborative filtering forpeople to people recommendation in social networksrdquo in AI2010 Advances in Artificial Intelligence vol 6464 pp 476ndash485Springer Berlin Germany 2010

[11] C Biancalana F Gasparetti A Micarelli A Miola and GSansonetti ldquoContext-aware movie recommendation based onsignal processing and machine learningrdquo in Proceedings of the2nd Challenge on Context-Aware Movie Recommendation pp5ndash10 October 2011

[12] P Jaccard ldquoThe distribution of the flora in the alpine zone 1rdquoThe New Phytologist vol 11 no 2 pp 37ndash50 1912

[13] B J Frey and D Dueck ldquoClustering by passing messages be-tween data pointsrdquo Science vol 315 no 5814 pp 972ndash976 2007

[14] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining (KDD rsquo96) vol 96 pp226ndash231 1996

[15] AK Jain andRCDubesAlgorithms for ClusteringData Pren-tice-Hall Upper Saddle River NJ USA 1988

[16] D Pomerantz and G Dudek ldquoContext dependent movie re-commendations using a hierarchical Bayesian modelrdquo in Ad-vances in artificial intelligence vol 5549 of Lecture Notes inComputer Science pp 98ndash109 Springer Berlin Germany 2009

[17] C Basu H Hirsh and W Cohen ldquoRecommendation as classi-fication using social and content-based information in recom-mendationrdquo in Proceedings of the 15th National Conference onArtificial Intelligence (AAAI rsquo98) pp 714ndash720 July 1998

[18] ldquo23 clusteringmdashscikit-learn 014 documentationrdquo httpscikit-learnorgstablemodulesclusteringhtmlclustering

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 6: Research Article Precomputed Clustering for Movie ...downloads.hindawi.com/journals/jam/2014/742341.pdfResearch Article Precomputed Clustering for Movie Recommendation System in Real

6 Journal of Applied Mathematics

25

20

15

10

05

00

minus05

minus10

minus15

minus20minus20 2015100500minus05minus10minus15minus25

Figure 8 DBSCAN demo [18]

Due to this rather generic view clusters found by DBSCANcan be in any shape as opposed to k-means which assumesthat clusters are convex shapedThe central component to theDBSCAN is the concept of core samples which are samplesthat are in areas of high density A cluster is therefore a setof core samples each close to each other (measured by somedistancemeasure) and a set of non-core samples that are closeto a core sample (but are not themselves core samples) asdemonstrated in Figure 8

513 Hierarchical Clustering Hierarchical clustering [15] is ageneral family of clustering algorithms that build nested clus-ters bymerging them successivelyThis hierarchy of clusters isrepresented as a treeThe root of the tree is the unique clusterthat gathers all the samples the leaves being the clusters withonly one sample

The beauty of this clustering method is that we have noneed to provide the size of clustering ahead Instead we caninitialize with a large cut height and decrease the cut heightwhen the user has voted most movies in one cluster In thisway we can provide a larger cluster to fit userrsquos interests Asshown in Figure 9

514 Random Clustering We first fix the size of clustering to60 and then for each movie we randomly pick the cluster itbelongs to by generating a random number between 1 and 60

52 Preclustering System Output Because of seeking effi-ciency we output the clustering label results in two formatsOne can be used to check clustering labels by movie ID andanother is used to check all the moviesrsquo IDs in one clustergiven the cluster label By outputting the file in JSON we caneasily transfer data between preclustering system in Pythonand real time online recommendation system in PHP

6 Evaluation

Before the evaluation starts we created a survey and askedusers to select the top three movie features they care about as

Table 1 Comparison of results between different clustering meth-ods in first evaluation method

True positive False positive Accuracy rateAF 239 98 70920DB 174 77 69323HC 228 88 72152RA 125 104 54585

mentioned in Section 41 After one week of data collectionwe have the result which is shown in Figure 4 as mentioned

Based on the result we calculate the distance matrix andmake it as an input to the different clusters We implementedthe four clustering techniques by using python library Scikit-learn

61 Experimental Setup The evaluation has been conductedusing a university social website in China since the website isowned by one of our authors We add the recommendationfunction in a banner and put it on top of the movie service tolet users vote The overview voting component of the websiteis showed in Figure 10

There are three voting choices like do not like andunseen for each recommended movie Voting like scores 1voting do not like scores minus1 and voting unseen scores 0Because of the limited time of data collection we cannot waitfor users to watch the recommended movies Therefore weadd the unseen choices for users to select if they have not seenthe recommended movies and only consider the votes whichare like or do not like to evaluate our approach

62 Result After two weeks voting for themovie recommen-dation we have collected 168424 total votes 132360 votes arecollected in 12 days and used for recommendation majorityvoting as mentioned in Section 33 The remaining 6064votes are collected in two days after the recommendationsystem is deployed Since only the like vote and do not likevote are considered illegible we generated the results in thefollowing subsections

621 First Evaluation Result In the first evaluation resultlisted in Table 1 we calculated the true positive false positiveand accuracy for each clustering techniques True positiveequals the number of like votes and false positive equals thenumber of do not like votes Because most of the recom-mended movies are voted unseen there are a total of 1133votes counted and 216 users participated in the votingConsider the following

Accuracy = True PositiveTrue Positive + False Positive

(8)

As shown by the result in Table 1 the hierarchical clus-tering approach is the best However the accuracy of everyclustering approach is not satisfied Therefore we extractedthe usersrsquo voting history from our server database andanalyzed those data We found that some of the users madeless than 5 votes some of them made between 5 to 10 votes

Journal of Applied Mathematics 7

08

06

04

02

00

015

Hei

ght

X31

X37

X0 X20

X28

X38

X11

X35 X3 X29

X10

X36

X7 X24 X6 X3

0X2

5X5 X21

X19

X1 X13 X18

X14

X17

X15

X26 X12

X33 X8

X2X2

2X2

3 X32

X9 X27 X4 X34

X16

Figure 9 Hierarchical clustering demo

Beijing institute of technology

IMBD rating 71IMBD rating 71IMBD rating 71

IMBD rating 71 IMBD rating 85 IMBD rating 80

Isabella

Add

Voting

Bookmarkfind us on Weibo Blog Notifica-

tionsMessages Tasks Profiles logout

Forum Home Blog Search Support

like dislike unseen detailslike dislike unseen detailslike dislike unseen detailslike dislike unseen detailslike dislike unseen detailslike dislike unseen details

Figure 10 Website overview

There are a few users who made more than 20 votes In thosepeople who made more than 20 votes we found that 4 usersmade all their votes do not like Therefore almost one thirdof the false positive votes were made by only 4 users over 216users Table 3 shows one example from one typical user Thefirst column is the movie ID from imdb the second columnis the voting results and the third column is the voting timeWe think maybe this user dislikes our recommendation ideaso he made all the votes negative

There is another small portion of users made more than20 votes Most of the votes are negative Only a few votesare positive and no zero or unseen votes Table 4 shows oneof such examples It is probably because this user is votingnegative instead of the unseen choices for the unseenmoviessincemost of the users were voting unseenmore than like anddo not like

622 Second Evaluation Result These two situations men-tioned in Section 621 effect our result a lot in the first eval-uationTherefore wemade a second evaluation by calculatingeach userrsquos accuracy in formula (8) and then computing theaverage of all userrsquos accuracy Consider the following

Accuracy =sum accuracy

Total Number ofUsers (9)

Table 2 Comparison of results between different clustering meth-ods in second evaluation method

Accuracy rateAF 8095DB 8471HC 8203RA 6351

In the second evaluation we wait for another 6 hours tocollect more data There are 7100 votes in total 1420 votesare considered and 242 users participated in voting Basedon the second evaluation we obtained much better result asshown in Table 2 DBSCAN (density-based clustering) affin-ity propagation and hierarchical clustering obtained over80 accuracy DBSCAN reached to 8471

63 Comparison with Related Work In this section we com-pared our result with those of with several other movie re-commendation approachsystems Pomerantz and Dudek[16] used a hierarchical Bayesian approach combined withthe item similarity of the content for themovie recommenderand achieved the classifiers of 708 as the best result InBiancalana et alrsquos paper [11] the best result is 824 obtainedby combining three classifiers in a neural network Basu et al

8 Journal of Applied Mathematics

Table 3 Example of all negative votes

imdb Vote Last updatett1716777 minus1 2013-12-06 233655tt0183649 minus1 2013-12-06 233656tt1728196 minus1 2013-12-06 233658tt0014429 minus1 2013-12-06 233659tt0034248 minus1 2013-12-06 233701tt0017136 minus1 2013-12-06 233702tt0035169 minus1 2013-12-06 233703tt0023940 minus1 2013-12-06 233704tt0035209 minus1 2013-12-06 233705tt0047876 minus1 2013-12-06 233723tt0106535 minus1 2013-12-06 233724tt0121164 minus1 2013-12-06 233727tt0006864 minus1 2013-12-06 233728tt0499141 minus1 2013-12-06 233730tt0049010 minus1 2013-12-06 233732tt1298650 minus1 2013-12-06 234039tt0408236 minus1 2013-12-06 234040tt0162661 minus1 2013-12-06 234041tt0349903 minus1 2013-12-06 234043tt0114369 minus1 2013-12-06 234044tt0096895 minus1 2013-12-06 234045tt0111161 minus1 2013-12-06 234049tt2992146 minus1 2013-12-06 234053tt0082971 minus1 2013-12-06 234056tt1075419 minus1 2013-12-06 234058

[17] presented an inductive learning approach to recommen-dation and achieved an averaged precision of 83

Our best result (8471) achieved by DBSCAN [14] asshown in Table 2 improves 171 compared to best result ofrelated works mentioned above Meanwhile all these relatedworks are computed offline and they use the stored modelswhen testing and these models cannot be updated in timesince it takes too long to redo the learning process Our ap-proach on the other hand does not need to store usermodelsand it updates the recommendation result in real time bypreclustering movies and using a very simple learning ap-proach (majority voting) to provide movie recommendationSince movies are updated much less frequently than usersrsquovotes our approach is much more time efficient than relatedworks To sum up our research becomes significantly mean-ingful as it realizes the real timemodel updating while it doesnot compromise the accuracy

7 Conclusion and Future Work

In this paper we proposed a newdistance function for publishyear (year related) to compute the distance matrix and thenimplemented a real time movie recommendation systembased on content (movie) using preclustering and majorityvoting We implemented four different clustering techniquesin the preclustering process and obtained 8471 accuracy

Table 4 Example of nonzero votes

imdb Vote Last updatett0068767 minus1 2013-12-06 223330tt0109958 minus1 2013-12-06 223332tt0047034 minus1 2013-12-06 223333tt0397892 minus1 2013-12-06 223335tt0032551 minus1 2013-12-06 223336tt0110008 1 2013-12-06 223338tt0416881 minus1 2013-12-06 223340tt1233499 minus1 2013-12-06 223342tt1754691 minus1 2013-12-06 223345tt1034303 minus1 2013-12-06 223347tt0015324 minus1 2013-12-06 223348tt0049408 minus1 2013-12-06 223350tt0033467 minus1 2013-12-06 223351tt0000417 minus1 2013-12-06 223354tt1437358 minus1 2013-12-06 223356tt0063105 minus1 2013-12-06 223357tt0253474 minus1 2013-12-06 223358tt0034248 1 2013-12-06 223400tt0018578 1 2013-12-06 223404tt0335345 minus1 2013-12-06 223407tt0028445 1 2013-12-06 223410tt0040536 1 2013-12-06 223416tt0332379 minus1 2013-12-06 223419tt1728196 minus1 2013-12-06 223421

as the best result which is better than some of the otherresearch papersrsquo approachesWe realized a real time updatingmodel with no compromises on accuracy Our approach caneven work better with special user groups such as people incolleges universities or professional communities

The next step in this research is to make the systemadaptive to bewidely used bymany other types of groups suchas articles news and music and to obtain better accuracyOne of the limitations in our approach is that if there isa cluster that contains a lot of positive votes as well asa lot of negative ones we will not recommend movies inthis cluster However this situation may happen becauseof two totally different reasons One is that the user doesnot care about the movies in this cluster In this case norecommendation in this cluster is a wise choice Howeverif the user really likes the movies in this cluster and heshewatched a lot of movies in this cluster and some moviesheshe likes while some heshe dislikes ignoring this clusterin such situation will dissatisfy the user To conquer thislimitation we can give weights to the votes from users Inthe movie which is voted positive by the user and is votedpositive by many other people as well the weight of suchvote will be decreased while if the movie is voted positiveby the user but is voted negative by most of others thismeans such vote reveals the special taste from this user andgains higher weight and vice versa In a more specific waywe can set the weight of positive as 119873(119875 + 119873) and theweight of negative as 119875(119875 + 119873) where 119875 is the total positive

Journal of Applied Mathematics 9

votes from all users towards this movie and 119873 is the totalnegative votes from all users towards this movie In this waywe can still ignore the cluster in first situation while over-coming the limitation we had in the second situation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors especially thank Dr Khaled Rasheed for provid-ing them with suggestions and information regarding theirresearch The authors also thank Mr Hongfei Yan for the re-lated work and discussions

References

[1] M A Ghazanfar and A Prugel-Bennett ldquoAn improved switch-ing hybrid recommender system using Naive Bayes classifierand collaborative filteringrdquo in Proceedings of the InternationalMultiConference of Engineers and Computer Scientists (IMECSrsquo10) pp 493ndash502 March 2010

[2] A Bellogın I Cantador P Castells and A Ortigosa ldquoDis-covering relevant preferences in a personalised recommendersystemusingmachine learning techniquesrdquo inProceedings of theECML-PKDDWorkshop on Preference Learning 2008

[3] F-C LinH-WYu C-HHsu andT-CWeng ldquoRecommenda-tion system for localized products in vendingmachinesrdquo ExpertSystems with Applications vol 38 no 8 pp 9129ndash9138 2011

[4] K-J Kim and H Ahn ldquoA recommender system using GA K-means clustering in an online shoppingmarketrdquo Expert Systemswith Applications vol 34 no 2 pp 1200ndash1209 2008

[5] S B Aher and LM R J Lobo ldquoCombination ofmachine learn-ing algorithms for recommendation of courses in E-LearningSystem based on historical datardquoKnowledge-Based Systems vol51 pp 1ndash14 2013

[6] G M Dakhel and M Mahdavi ldquoProviding an effective col-laborative filtering algorithm based on distance measures andneighborsrsquo votingrdquo International Journal of Computer Informa-tion Systems and Industrial Management Applications vol 5 pp524ndash531 2013

[7] W H Ujwala V R Sheetal and M Debajyoti ldquoA hybrid webrecommendation system based on the improved associationrule mining algorithmrdquo Journal of Software Engineering andApplications vol 6 article 396 2013

[8] I Tatli and A Birturk ldquoA tag-based hybrid music recom-mendation system using semantic relations and multi-domaininformationrdquo in Proceedings of the 11th IEEE InternationalConference on Data Mining Workshops (ICDMW rsquo11) pp 548ndash554 December 2011

[9] Q Yang J Fan J Wang and L Zhou ldquoPersonalizing web pagerecommendation via collaborative filtering and topic-awareMarkov modelrdquo in Proceedings of the 10th IEEE InternationalConference on Data Mining (ICDM rsquo10) pp 1145ndash1150 Decem-ber 2010

[10] X Cai M Bain A Krzywicki et al ldquoCollaborative filtering forpeople to people recommendation in social networksrdquo in AI2010 Advances in Artificial Intelligence vol 6464 pp 476ndash485Springer Berlin Germany 2010

[11] C Biancalana F Gasparetti A Micarelli A Miola and GSansonetti ldquoContext-aware movie recommendation based onsignal processing and machine learningrdquo in Proceedings of the2nd Challenge on Context-Aware Movie Recommendation pp5ndash10 October 2011

[12] P Jaccard ldquoThe distribution of the flora in the alpine zone 1rdquoThe New Phytologist vol 11 no 2 pp 37ndash50 1912

[13] B J Frey and D Dueck ldquoClustering by passing messages be-tween data pointsrdquo Science vol 315 no 5814 pp 972ndash976 2007

[14] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining (KDD rsquo96) vol 96 pp226ndash231 1996

[15] AK Jain andRCDubesAlgorithms for ClusteringData Pren-tice-Hall Upper Saddle River NJ USA 1988

[16] D Pomerantz and G Dudek ldquoContext dependent movie re-commendations using a hierarchical Bayesian modelrdquo in Ad-vances in artificial intelligence vol 5549 of Lecture Notes inComputer Science pp 98ndash109 Springer Berlin Germany 2009

[17] C Basu H Hirsh and W Cohen ldquoRecommendation as classi-fication using social and content-based information in recom-mendationrdquo in Proceedings of the 15th National Conference onArtificial Intelligence (AAAI rsquo98) pp 714ndash720 July 1998

[18] ldquo23 clusteringmdashscikit-learn 014 documentationrdquo httpscikit-learnorgstablemodulesclusteringhtmlclustering

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 7: Research Article Precomputed Clustering for Movie ...downloads.hindawi.com/journals/jam/2014/742341.pdfResearch Article Precomputed Clustering for Movie Recommendation System in Real

Journal of Applied Mathematics 7

08

06

04

02

00

015

Hei

ght

X31

X37

X0 X20

X28

X38

X11

X35 X3 X29

X10

X36

X7 X24 X6 X3

0X2

5X5 X21

X19

X1 X13 X18

X14

X17

X15

X26 X12

X33 X8

X2X2

2X2

3 X32

X9 X27 X4 X34

X16

Figure 9 Hierarchical clustering demo

Beijing institute of technology

IMBD rating 71IMBD rating 71IMBD rating 71

IMBD rating 71 IMBD rating 85 IMBD rating 80

Isabella

Add

Voting

Bookmarkfind us on Weibo Blog Notifica-

tionsMessages Tasks Profiles logout

Forum Home Blog Search Support

like dislike unseen detailslike dislike unseen detailslike dislike unseen detailslike dislike unseen detailslike dislike unseen detailslike dislike unseen details

Figure 10 Website overview

There are a few users who made more than 20 votes In thosepeople who made more than 20 votes we found that 4 usersmade all their votes do not like Therefore almost one thirdof the false positive votes were made by only 4 users over 216users Table 3 shows one example from one typical user Thefirst column is the movie ID from imdb the second columnis the voting results and the third column is the voting timeWe think maybe this user dislikes our recommendation ideaso he made all the votes negative

There is another small portion of users made more than20 votes Most of the votes are negative Only a few votesare positive and no zero or unseen votes Table 4 shows oneof such examples It is probably because this user is votingnegative instead of the unseen choices for the unseenmoviessincemost of the users were voting unseenmore than like anddo not like

622 Second Evaluation Result These two situations men-tioned in Section 621 effect our result a lot in the first eval-uationTherefore wemade a second evaluation by calculatingeach userrsquos accuracy in formula (8) and then computing theaverage of all userrsquos accuracy Consider the following

Accuracy =sum accuracy

Total Number ofUsers (9)

Table 2 Comparison of results between different clustering meth-ods in second evaluation method

Accuracy rateAF 8095DB 8471HC 8203RA 6351

In the second evaluation we wait for another 6 hours tocollect more data There are 7100 votes in total 1420 votesare considered and 242 users participated in voting Basedon the second evaluation we obtained much better result asshown in Table 2 DBSCAN (density-based clustering) affin-ity propagation and hierarchical clustering obtained over80 accuracy DBSCAN reached to 8471

63 Comparison with Related Work In this section we com-pared our result with those of with several other movie re-commendation approachsystems Pomerantz and Dudek[16] used a hierarchical Bayesian approach combined withthe item similarity of the content for themovie recommenderand achieved the classifiers of 708 as the best result InBiancalana et alrsquos paper [11] the best result is 824 obtainedby combining three classifiers in a neural network Basu et al

8 Journal of Applied Mathematics

Table 3 Example of all negative votes

imdb Vote Last updatett1716777 minus1 2013-12-06 233655tt0183649 minus1 2013-12-06 233656tt1728196 minus1 2013-12-06 233658tt0014429 minus1 2013-12-06 233659tt0034248 minus1 2013-12-06 233701tt0017136 minus1 2013-12-06 233702tt0035169 minus1 2013-12-06 233703tt0023940 minus1 2013-12-06 233704tt0035209 minus1 2013-12-06 233705tt0047876 minus1 2013-12-06 233723tt0106535 minus1 2013-12-06 233724tt0121164 minus1 2013-12-06 233727tt0006864 minus1 2013-12-06 233728tt0499141 minus1 2013-12-06 233730tt0049010 minus1 2013-12-06 233732tt1298650 minus1 2013-12-06 234039tt0408236 minus1 2013-12-06 234040tt0162661 minus1 2013-12-06 234041tt0349903 minus1 2013-12-06 234043tt0114369 minus1 2013-12-06 234044tt0096895 minus1 2013-12-06 234045tt0111161 minus1 2013-12-06 234049tt2992146 minus1 2013-12-06 234053tt0082971 minus1 2013-12-06 234056tt1075419 minus1 2013-12-06 234058

[17] presented an inductive learning approach to recommen-dation and achieved an averaged precision of 83

Our best result (8471) achieved by DBSCAN [14] asshown in Table 2 improves 171 compared to best result ofrelated works mentioned above Meanwhile all these relatedworks are computed offline and they use the stored modelswhen testing and these models cannot be updated in timesince it takes too long to redo the learning process Our ap-proach on the other hand does not need to store usermodelsand it updates the recommendation result in real time bypreclustering movies and using a very simple learning ap-proach (majority voting) to provide movie recommendationSince movies are updated much less frequently than usersrsquovotes our approach is much more time efficient than relatedworks To sum up our research becomes significantly mean-ingful as it realizes the real timemodel updating while it doesnot compromise the accuracy

7 Conclusion and Future Work

In this paper we proposed a newdistance function for publishyear (year related) to compute the distance matrix and thenimplemented a real time movie recommendation systembased on content (movie) using preclustering and majorityvoting We implemented four different clustering techniquesin the preclustering process and obtained 8471 accuracy

Table 4 Example of nonzero votes

imdb Vote Last updatett0068767 minus1 2013-12-06 223330tt0109958 minus1 2013-12-06 223332tt0047034 minus1 2013-12-06 223333tt0397892 minus1 2013-12-06 223335tt0032551 minus1 2013-12-06 223336tt0110008 1 2013-12-06 223338tt0416881 minus1 2013-12-06 223340tt1233499 minus1 2013-12-06 223342tt1754691 minus1 2013-12-06 223345tt1034303 minus1 2013-12-06 223347tt0015324 minus1 2013-12-06 223348tt0049408 minus1 2013-12-06 223350tt0033467 minus1 2013-12-06 223351tt0000417 minus1 2013-12-06 223354tt1437358 minus1 2013-12-06 223356tt0063105 minus1 2013-12-06 223357tt0253474 minus1 2013-12-06 223358tt0034248 1 2013-12-06 223400tt0018578 1 2013-12-06 223404tt0335345 minus1 2013-12-06 223407tt0028445 1 2013-12-06 223410tt0040536 1 2013-12-06 223416tt0332379 minus1 2013-12-06 223419tt1728196 minus1 2013-12-06 223421

as the best result which is better than some of the otherresearch papersrsquo approachesWe realized a real time updatingmodel with no compromises on accuracy Our approach caneven work better with special user groups such as people incolleges universities or professional communities

The next step in this research is to make the systemadaptive to bewidely used bymany other types of groups suchas articles news and music and to obtain better accuracyOne of the limitations in our approach is that if there isa cluster that contains a lot of positive votes as well asa lot of negative ones we will not recommend movies inthis cluster However this situation may happen becauseof two totally different reasons One is that the user doesnot care about the movies in this cluster In this case norecommendation in this cluster is a wise choice Howeverif the user really likes the movies in this cluster and heshewatched a lot of movies in this cluster and some moviesheshe likes while some heshe dislikes ignoring this clusterin such situation will dissatisfy the user To conquer thislimitation we can give weights to the votes from users Inthe movie which is voted positive by the user and is votedpositive by many other people as well the weight of suchvote will be decreased while if the movie is voted positiveby the user but is voted negative by most of others thismeans such vote reveals the special taste from this user andgains higher weight and vice versa In a more specific waywe can set the weight of positive as 119873(119875 + 119873) and theweight of negative as 119875(119875 + 119873) where 119875 is the total positive

Journal of Applied Mathematics 9

votes from all users towards this movie and 119873 is the totalnegative votes from all users towards this movie In this waywe can still ignore the cluster in first situation while over-coming the limitation we had in the second situation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors especially thank Dr Khaled Rasheed for provid-ing them with suggestions and information regarding theirresearch The authors also thank Mr Hongfei Yan for the re-lated work and discussions

References

[1] M A Ghazanfar and A Prugel-Bennett ldquoAn improved switch-ing hybrid recommender system using Naive Bayes classifierand collaborative filteringrdquo in Proceedings of the InternationalMultiConference of Engineers and Computer Scientists (IMECSrsquo10) pp 493ndash502 March 2010

[2] A Bellogın I Cantador P Castells and A Ortigosa ldquoDis-covering relevant preferences in a personalised recommendersystemusingmachine learning techniquesrdquo inProceedings of theECML-PKDDWorkshop on Preference Learning 2008

[3] F-C LinH-WYu C-HHsu andT-CWeng ldquoRecommenda-tion system for localized products in vendingmachinesrdquo ExpertSystems with Applications vol 38 no 8 pp 9129ndash9138 2011

[4] K-J Kim and H Ahn ldquoA recommender system using GA K-means clustering in an online shoppingmarketrdquo Expert Systemswith Applications vol 34 no 2 pp 1200ndash1209 2008

[5] S B Aher and LM R J Lobo ldquoCombination ofmachine learn-ing algorithms for recommendation of courses in E-LearningSystem based on historical datardquoKnowledge-Based Systems vol51 pp 1ndash14 2013

[6] G M Dakhel and M Mahdavi ldquoProviding an effective col-laborative filtering algorithm based on distance measures andneighborsrsquo votingrdquo International Journal of Computer Informa-tion Systems and Industrial Management Applications vol 5 pp524ndash531 2013

[7] W H Ujwala V R Sheetal and M Debajyoti ldquoA hybrid webrecommendation system based on the improved associationrule mining algorithmrdquo Journal of Software Engineering andApplications vol 6 article 396 2013

[8] I Tatli and A Birturk ldquoA tag-based hybrid music recom-mendation system using semantic relations and multi-domaininformationrdquo in Proceedings of the 11th IEEE InternationalConference on Data Mining Workshops (ICDMW rsquo11) pp 548ndash554 December 2011

[9] Q Yang J Fan J Wang and L Zhou ldquoPersonalizing web pagerecommendation via collaborative filtering and topic-awareMarkov modelrdquo in Proceedings of the 10th IEEE InternationalConference on Data Mining (ICDM rsquo10) pp 1145ndash1150 Decem-ber 2010

[10] X Cai M Bain A Krzywicki et al ldquoCollaborative filtering forpeople to people recommendation in social networksrdquo in AI2010 Advances in Artificial Intelligence vol 6464 pp 476ndash485Springer Berlin Germany 2010

[11] C Biancalana F Gasparetti A Micarelli A Miola and GSansonetti ldquoContext-aware movie recommendation based onsignal processing and machine learningrdquo in Proceedings of the2nd Challenge on Context-Aware Movie Recommendation pp5ndash10 October 2011

[12] P Jaccard ldquoThe distribution of the flora in the alpine zone 1rdquoThe New Phytologist vol 11 no 2 pp 37ndash50 1912

[13] B J Frey and D Dueck ldquoClustering by passing messages be-tween data pointsrdquo Science vol 315 no 5814 pp 972ndash976 2007

[14] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining (KDD rsquo96) vol 96 pp226ndash231 1996

[15] AK Jain andRCDubesAlgorithms for ClusteringData Pren-tice-Hall Upper Saddle River NJ USA 1988

[16] D Pomerantz and G Dudek ldquoContext dependent movie re-commendations using a hierarchical Bayesian modelrdquo in Ad-vances in artificial intelligence vol 5549 of Lecture Notes inComputer Science pp 98ndash109 Springer Berlin Germany 2009

[17] C Basu H Hirsh and W Cohen ldquoRecommendation as classi-fication using social and content-based information in recom-mendationrdquo in Proceedings of the 15th National Conference onArtificial Intelligence (AAAI rsquo98) pp 714ndash720 July 1998

[18] ldquo23 clusteringmdashscikit-learn 014 documentationrdquo httpscikit-learnorgstablemodulesclusteringhtmlclustering

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 8: Research Article Precomputed Clustering for Movie ...downloads.hindawi.com/journals/jam/2014/742341.pdfResearch Article Precomputed Clustering for Movie Recommendation System in Real

8 Journal of Applied Mathematics

Table 3 Example of all negative votes

imdb Vote Last updatett1716777 minus1 2013-12-06 233655tt0183649 minus1 2013-12-06 233656tt1728196 minus1 2013-12-06 233658tt0014429 minus1 2013-12-06 233659tt0034248 minus1 2013-12-06 233701tt0017136 minus1 2013-12-06 233702tt0035169 minus1 2013-12-06 233703tt0023940 minus1 2013-12-06 233704tt0035209 minus1 2013-12-06 233705tt0047876 minus1 2013-12-06 233723tt0106535 minus1 2013-12-06 233724tt0121164 minus1 2013-12-06 233727tt0006864 minus1 2013-12-06 233728tt0499141 minus1 2013-12-06 233730tt0049010 minus1 2013-12-06 233732tt1298650 minus1 2013-12-06 234039tt0408236 minus1 2013-12-06 234040tt0162661 minus1 2013-12-06 234041tt0349903 minus1 2013-12-06 234043tt0114369 minus1 2013-12-06 234044tt0096895 minus1 2013-12-06 234045tt0111161 minus1 2013-12-06 234049tt2992146 minus1 2013-12-06 234053tt0082971 minus1 2013-12-06 234056tt1075419 minus1 2013-12-06 234058

[17] presented an inductive learning approach to recommen-dation and achieved an averaged precision of 83

Our best result (8471) achieved by DBSCAN [14] asshown in Table 2 improves 171 compared to best result ofrelated works mentioned above Meanwhile all these relatedworks are computed offline and they use the stored modelswhen testing and these models cannot be updated in timesince it takes too long to redo the learning process Our ap-proach on the other hand does not need to store usermodelsand it updates the recommendation result in real time bypreclustering movies and using a very simple learning ap-proach (majority voting) to provide movie recommendationSince movies are updated much less frequently than usersrsquovotes our approach is much more time efficient than relatedworks To sum up our research becomes significantly mean-ingful as it realizes the real timemodel updating while it doesnot compromise the accuracy

7 Conclusion and Future Work

In this paper we proposed a newdistance function for publishyear (year related) to compute the distance matrix and thenimplemented a real time movie recommendation systembased on content (movie) using preclustering and majorityvoting We implemented four different clustering techniquesin the preclustering process and obtained 8471 accuracy

Table 4 Example of nonzero votes

imdb Vote Last updatett0068767 minus1 2013-12-06 223330tt0109958 minus1 2013-12-06 223332tt0047034 minus1 2013-12-06 223333tt0397892 minus1 2013-12-06 223335tt0032551 minus1 2013-12-06 223336tt0110008 1 2013-12-06 223338tt0416881 minus1 2013-12-06 223340tt1233499 minus1 2013-12-06 223342tt1754691 minus1 2013-12-06 223345tt1034303 minus1 2013-12-06 223347tt0015324 minus1 2013-12-06 223348tt0049408 minus1 2013-12-06 223350tt0033467 minus1 2013-12-06 223351tt0000417 minus1 2013-12-06 223354tt1437358 minus1 2013-12-06 223356tt0063105 minus1 2013-12-06 223357tt0253474 minus1 2013-12-06 223358tt0034248 1 2013-12-06 223400tt0018578 1 2013-12-06 223404tt0335345 minus1 2013-12-06 223407tt0028445 1 2013-12-06 223410tt0040536 1 2013-12-06 223416tt0332379 minus1 2013-12-06 223419tt1728196 minus1 2013-12-06 223421

as the best result which is better than some of the otherresearch papersrsquo approachesWe realized a real time updatingmodel with no compromises on accuracy Our approach caneven work better with special user groups such as people incolleges universities or professional communities

The next step in this research is to make the systemadaptive to bewidely used bymany other types of groups suchas articles news and music and to obtain better accuracyOne of the limitations in our approach is that if there isa cluster that contains a lot of positive votes as well asa lot of negative ones we will not recommend movies inthis cluster However this situation may happen becauseof two totally different reasons One is that the user doesnot care about the movies in this cluster In this case norecommendation in this cluster is a wise choice Howeverif the user really likes the movies in this cluster and heshewatched a lot of movies in this cluster and some moviesheshe likes while some heshe dislikes ignoring this clusterin such situation will dissatisfy the user To conquer thislimitation we can give weights to the votes from users Inthe movie which is voted positive by the user and is votedpositive by many other people as well the weight of suchvote will be decreased while if the movie is voted positiveby the user but is voted negative by most of others thismeans such vote reveals the special taste from this user andgains higher weight and vice versa In a more specific waywe can set the weight of positive as 119873(119875 + 119873) and theweight of negative as 119875(119875 + 119873) where 119875 is the total positive

Journal of Applied Mathematics 9

votes from all users towards this movie and 119873 is the totalnegative votes from all users towards this movie In this waywe can still ignore the cluster in first situation while over-coming the limitation we had in the second situation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors especially thank Dr Khaled Rasheed for provid-ing them with suggestions and information regarding theirresearch The authors also thank Mr Hongfei Yan for the re-lated work and discussions

References

[1] M A Ghazanfar and A Prugel-Bennett ldquoAn improved switch-ing hybrid recommender system using Naive Bayes classifierand collaborative filteringrdquo in Proceedings of the InternationalMultiConference of Engineers and Computer Scientists (IMECSrsquo10) pp 493ndash502 March 2010

[2] A Bellogın I Cantador P Castells and A Ortigosa ldquoDis-covering relevant preferences in a personalised recommendersystemusingmachine learning techniquesrdquo inProceedings of theECML-PKDDWorkshop on Preference Learning 2008

[3] F-C LinH-WYu C-HHsu andT-CWeng ldquoRecommenda-tion system for localized products in vendingmachinesrdquo ExpertSystems with Applications vol 38 no 8 pp 9129ndash9138 2011

[4] K-J Kim and H Ahn ldquoA recommender system using GA K-means clustering in an online shoppingmarketrdquo Expert Systemswith Applications vol 34 no 2 pp 1200ndash1209 2008

[5] S B Aher and LM R J Lobo ldquoCombination ofmachine learn-ing algorithms for recommendation of courses in E-LearningSystem based on historical datardquoKnowledge-Based Systems vol51 pp 1ndash14 2013

[6] G M Dakhel and M Mahdavi ldquoProviding an effective col-laborative filtering algorithm based on distance measures andneighborsrsquo votingrdquo International Journal of Computer Informa-tion Systems and Industrial Management Applications vol 5 pp524ndash531 2013

[7] W H Ujwala V R Sheetal and M Debajyoti ldquoA hybrid webrecommendation system based on the improved associationrule mining algorithmrdquo Journal of Software Engineering andApplications vol 6 article 396 2013

[8] I Tatli and A Birturk ldquoA tag-based hybrid music recom-mendation system using semantic relations and multi-domaininformationrdquo in Proceedings of the 11th IEEE InternationalConference on Data Mining Workshops (ICDMW rsquo11) pp 548ndash554 December 2011

[9] Q Yang J Fan J Wang and L Zhou ldquoPersonalizing web pagerecommendation via collaborative filtering and topic-awareMarkov modelrdquo in Proceedings of the 10th IEEE InternationalConference on Data Mining (ICDM rsquo10) pp 1145ndash1150 Decem-ber 2010

[10] X Cai M Bain A Krzywicki et al ldquoCollaborative filtering forpeople to people recommendation in social networksrdquo in AI2010 Advances in Artificial Intelligence vol 6464 pp 476ndash485Springer Berlin Germany 2010

[11] C Biancalana F Gasparetti A Micarelli A Miola and GSansonetti ldquoContext-aware movie recommendation based onsignal processing and machine learningrdquo in Proceedings of the2nd Challenge on Context-Aware Movie Recommendation pp5ndash10 October 2011

[12] P Jaccard ldquoThe distribution of the flora in the alpine zone 1rdquoThe New Phytologist vol 11 no 2 pp 37ndash50 1912

[13] B J Frey and D Dueck ldquoClustering by passing messages be-tween data pointsrdquo Science vol 315 no 5814 pp 972ndash976 2007

[14] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining (KDD rsquo96) vol 96 pp226ndash231 1996

[15] AK Jain andRCDubesAlgorithms for ClusteringData Pren-tice-Hall Upper Saddle River NJ USA 1988

[16] D Pomerantz and G Dudek ldquoContext dependent movie re-commendations using a hierarchical Bayesian modelrdquo in Ad-vances in artificial intelligence vol 5549 of Lecture Notes inComputer Science pp 98ndash109 Springer Berlin Germany 2009

[17] C Basu H Hirsh and W Cohen ldquoRecommendation as classi-fication using social and content-based information in recom-mendationrdquo in Proceedings of the 15th National Conference onArtificial Intelligence (AAAI rsquo98) pp 714ndash720 July 1998

[18] ldquo23 clusteringmdashscikit-learn 014 documentationrdquo httpscikit-learnorgstablemodulesclusteringhtmlclustering

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 9: Research Article Precomputed Clustering for Movie ...downloads.hindawi.com/journals/jam/2014/742341.pdfResearch Article Precomputed Clustering for Movie Recommendation System in Real

Journal of Applied Mathematics 9

votes from all users towards this movie and 119873 is the totalnegative votes from all users towards this movie In this waywe can still ignore the cluster in first situation while over-coming the limitation we had in the second situation

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

Acknowledgments

The authors especially thank Dr Khaled Rasheed for provid-ing them with suggestions and information regarding theirresearch The authors also thank Mr Hongfei Yan for the re-lated work and discussions

References

[1] M A Ghazanfar and A Prugel-Bennett ldquoAn improved switch-ing hybrid recommender system using Naive Bayes classifierand collaborative filteringrdquo in Proceedings of the InternationalMultiConference of Engineers and Computer Scientists (IMECSrsquo10) pp 493ndash502 March 2010

[2] A Bellogın I Cantador P Castells and A Ortigosa ldquoDis-covering relevant preferences in a personalised recommendersystemusingmachine learning techniquesrdquo inProceedings of theECML-PKDDWorkshop on Preference Learning 2008

[3] F-C LinH-WYu C-HHsu andT-CWeng ldquoRecommenda-tion system for localized products in vendingmachinesrdquo ExpertSystems with Applications vol 38 no 8 pp 9129ndash9138 2011

[4] K-J Kim and H Ahn ldquoA recommender system using GA K-means clustering in an online shoppingmarketrdquo Expert Systemswith Applications vol 34 no 2 pp 1200ndash1209 2008

[5] S B Aher and LM R J Lobo ldquoCombination ofmachine learn-ing algorithms for recommendation of courses in E-LearningSystem based on historical datardquoKnowledge-Based Systems vol51 pp 1ndash14 2013

[6] G M Dakhel and M Mahdavi ldquoProviding an effective col-laborative filtering algorithm based on distance measures andneighborsrsquo votingrdquo International Journal of Computer Informa-tion Systems and Industrial Management Applications vol 5 pp524ndash531 2013

[7] W H Ujwala V R Sheetal and M Debajyoti ldquoA hybrid webrecommendation system based on the improved associationrule mining algorithmrdquo Journal of Software Engineering andApplications vol 6 article 396 2013

[8] I Tatli and A Birturk ldquoA tag-based hybrid music recom-mendation system using semantic relations and multi-domaininformationrdquo in Proceedings of the 11th IEEE InternationalConference on Data Mining Workshops (ICDMW rsquo11) pp 548ndash554 December 2011

[9] Q Yang J Fan J Wang and L Zhou ldquoPersonalizing web pagerecommendation via collaborative filtering and topic-awareMarkov modelrdquo in Proceedings of the 10th IEEE InternationalConference on Data Mining (ICDM rsquo10) pp 1145ndash1150 Decem-ber 2010

[10] X Cai M Bain A Krzywicki et al ldquoCollaborative filtering forpeople to people recommendation in social networksrdquo in AI2010 Advances in Artificial Intelligence vol 6464 pp 476ndash485Springer Berlin Germany 2010

[11] C Biancalana F Gasparetti A Micarelli A Miola and GSansonetti ldquoContext-aware movie recommendation based onsignal processing and machine learningrdquo in Proceedings of the2nd Challenge on Context-Aware Movie Recommendation pp5ndash10 October 2011

[12] P Jaccard ldquoThe distribution of the flora in the alpine zone 1rdquoThe New Phytologist vol 11 no 2 pp 37ndash50 1912

[13] B J Frey and D Dueck ldquoClustering by passing messages be-tween data pointsrdquo Science vol 315 no 5814 pp 972ndash976 2007

[14] M Ester H-P Kriegel J Sander and X Xu ldquoA density-basedalgorithm for discovering clusters in large spatial databases withnoiserdquo in Proceedings of the 2nd International Conference onKnowledge Discovery and Data Mining (KDD rsquo96) vol 96 pp226ndash231 1996

[15] AK Jain andRCDubesAlgorithms for ClusteringData Pren-tice-Hall Upper Saddle River NJ USA 1988

[16] D Pomerantz and G Dudek ldquoContext dependent movie re-commendations using a hierarchical Bayesian modelrdquo in Ad-vances in artificial intelligence vol 5549 of Lecture Notes inComputer Science pp 98ndash109 Springer Berlin Germany 2009

[17] C Basu H Hirsh and W Cohen ldquoRecommendation as classi-fication using social and content-based information in recom-mendationrdquo in Proceedings of the 15th National Conference onArtificial Intelligence (AAAI rsquo98) pp 714ndash720 July 1998

[18] ldquo23 clusteringmdashscikit-learn 014 documentationrdquo httpscikit-learnorgstablemodulesclusteringhtmlclustering

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

Page 10: Research Article Precomputed Clustering for Movie ...downloads.hindawi.com/journals/jam/2014/742341.pdfResearch Article Precomputed Clustering for Movie Recommendation System in Real

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

OptimizationJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Operations ResearchAdvances in

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Decision SciencesAdvances in

Discrete MathematicsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of