MOVIE REVIEW ANALYSIS AND PREDICTION USING MODIFIED ...€¦ · Recommender systems or...

12
http://www.iaeme.com/IJMET/index.asp 489 [email protected] International Journal of Mechanical Engineering and Technology (IJMET) Volume 10, Issue 02, February 2019, pp. 489500, Article ID: IJMET_10_02_050 Available online at http://www.iaeme.com/ijmet/issues.asp?JType=IJMET&VType=10&IType=2 ISSN Print: 0976-6340 and ISSN Online: 0976-6359 © IAEME Publication Scopus Indexed MOVIE REVIEW ANALYSIS AND PREDICTION USING MODIFIED COLLABORATIVE FILTERING AND CLUSTERING WITH REGRESSION J. Sangeetha Assistant Professor, Cauvery College for Women, Trichy, Tamilnadu, India Dr.V. Sinthu Janita Prakash Head, Department of Computer Science, Cauvery College for Women, Trichy, Tamilnadu, India ABSTRACT Internet becomes the most popular surfing environment which increases the service oriented data size. As the data size grows, finding and retrieving the most similar data from the large volume of data would become more difficult task. This problem is focused in the various research methods, which attempts to cluster the large volume of data. In the existing research method Clustering-based Collaborative Filtering approach (ClubCF) is introduced whose main goal is to cluster the similar kind of data together, so that retrieval time cost can be reduced considerably. However, existing research methods cannot find the similar reviews accurately which needs to be focused more for efficient and accurate recommendation system. This is ensured in the proposed research method by introducing the novel research technique namely Modified Collaborative Filtering and Clustering with Regression (MoCFCR). In this research method, initially k means algorithm is used to cluster the similar movie reviewer together, so that recommendation process can be done in the easier way. In order to handle the large volume of data this research work adapts the map reduce framework which will divide the entire data into subsets which will assigned on separate nodes with individual key values. After clustering, the clustered outcome is merged together using inverted index procedure in which similarity between movies would be calculated. Here collaborative filtering is applied to remove the movies that are not relevant to input. Finally recommendations of movies are made in the accurate way by using the logistic regression method. The overall evaluation of the proposed research method is done in Hadoop from which it can be proved that the proposed research technique can lead to provide better outcome than the existing research techniques.

Transcript of MOVIE REVIEW ANALYSIS AND PREDICTION USING MODIFIED ...€¦ · Recommender systems or...

Page 1: MOVIE REVIEW ANALYSIS AND PREDICTION USING MODIFIED ...€¦ · Recommender systems or recommendation systems are a subclass of information filtering ... data slicing techniques and

http://www.iaeme.com/IJMET/index.asp 489 [email protected]

International Journal of Mechanical Engineering and Technology (IJMET)

Volume 10, Issue 02, February 2019, pp. 489–500, Article ID: IJMET_10_02_050

Available online at http://www.iaeme.com/ijmet/issues.asp?JType=IJMET&VType=10&IType=2

ISSN Print: 0976-6340 and ISSN Online: 0976-6359

© IAEME Publication Scopus Indexed

MOVIE REVIEW ANALYSIS AND PREDICTION

USING MODIFIED

COLLABORATIVE FILTERING AND

CLUSTERING WITH REGRESSION

J. Sangeetha

Assistant Professor, Cauvery College for Women, Trichy, Tamilnadu, India

Dr.V. Sinthu Janita Prakash

Head, Department of Computer Science,

Cauvery College for Women, Trichy, Tamilnadu, India

ABSTRACT

Internet becomes the most popular surfing environment which increases the

service oriented data size. As the data size grows, finding and retrieving the most

similar data from the large volume of data would become more difficult task. This

problem is focused in the various research methods, which attempts to cluster the

large volume of data. In the existing research method Clustering-based Collaborative

Filtering approach (ClubCF) is introduced whose main goal is to cluster the similar

kind of data together, so that retrieval time cost can be reduced considerably.

However, existing research methods cannot find the similar reviews accurately which

needs to be focused more for efficient and accurate recommendation system. This is

ensured in the proposed research method by introducing the novel research technique

namely Modified Collaborative Filtering and Clustering with Regression (MoCFCR).

In this research method, initially k means algorithm is used to cluster the similar

movie reviewer together, so that recommendation process can be done in the easier

way. In order to handle the large volume of data this research work adapts the map

reduce framework which will divide the entire data into subsets which will assigned

on separate nodes with individual key values. After clustering, the clustered outcome

is merged together using inverted index procedure in which similarity between movies

would be calculated. Here collaborative filtering is applied to remove the movies that

are not relevant to input. Finally recommendations of movies are made in the accurate

way by using the logistic regression method. The overall evaluation of the proposed

research method is done in Hadoop from which it can be proved that the proposed

research technique can lead to provide better outcome than the existing research

techniques.

Page 2: MOVIE REVIEW ANALYSIS AND PREDICTION USING MODIFIED ...€¦ · Recommender systems or recommendation systems are a subclass of information filtering ... data slicing techniques and

J. Sangeetha and Dr. V. Sinthu Janita Prakash

http://www.iaeme.com/IJMET/index.asp 490 [email protected]

Key words: Clustering, Search retrieval, Semantic similarity, Partitions, map reduce.

Cite this Article: J. Sangeetha and Dr. V. Sinthu Janita Prakash, Movie Review

Analysis and Prediction Using Modified Collaborative Filtering and Clustering with

Regression, International Journal of Mechanical Engineering and Technology 10(2),

2019, pp. 489–500.

http://www.iaeme.com/IJMET/issues.asp?JType=IJMET&VType=10&IType=2

1. INTRODUCTION

Recommender systems or recommendation systems are a subclass of information filtering

system that seek to predict the "rating" or "preference" that a user would give to an

item[Ricci, F et al., 2011]. Recommender systems have become extremely common in recent

years, and are utilized in a variety of areas: some popular applications include movies, music,

news, books, research articles, search queries, social tags, and products in general. There are

also recommender systems for experts, collaborators, jokes, restaurants, garments, financial

services, life insurance.

Recommender systems typically produce a list of recommendations in one of two ways –

through collaborative and content-based filtering. Recommender System helps in addressing

the information overload problem by retrieving the information desired by the user based on

his or similar user’s preference and interests. Below figure show the phase of

recommendation system [Isinkaye, F. O., et al 2015].

Figure 1 Phases of recommendation system

Collaborative Filtering (CF) is a technique commonly used to build personalized

recommendations on the Web. Some popular websites that make use of the collaborative

filtering technology include Amazon, Netflix, iTunes, IMDB, LastFM, Delicious and Stumble

Upon. In collaborative filtering, algorithms are used to make automatic predictions about a

user's interests by compiling preferences from several users [Khorasani, E. S., et al 2016]

[Yang, Z., et al 2016].

The main goal of the proposed research method is to implement the method which can

effectively handle the large volume of datasets and find the similar items, so that

recommendation for end users can be made easily. This research work is carried out on movie

dataset whose main goal is to recommend the movies for the users based on preference. This

is achieved by introducing the novel research technique namely Modified

Collaborative Filtering and Clustering with Regression (MoCFCR).

Information collection phase

Learning phase

Recommendation phase

Feedback

Page 3: MOVIE REVIEW ANALYSIS AND PREDICTION USING MODIFIED ...€¦ · Recommender systems or recommendation systems are a subclass of information filtering ... data slicing techniques and

Movie Review Analysis and Prediction Using Modified Collaborative Filtering and Clustering

with Regression

http://www.iaeme.com/IJMET/index.asp 491 [email protected]

The overall organization of the proposed research method is shown as below. In this

section detailed introduction about the collaborative filtering system. In section 2, varying

related research methodologies has been discussed in detail based on their working procedure.

In section 3, a proposed research technique has been discussed in detail with suitable

examples and explanation. In section 4, performance evaluation of the proposed research

techniques has been carried over. Finally in section 5, overall conclusion of the proposed

research techniques has been given based on simulation outcome.

2. RELATED WORKS

Big data mining emerges as an innovative and potential research area for retrieving useful

data from huge datasets. It is utilized in real-time applications such as social site data

processing and biomedical applications to address massive volumes of data sets usually huge,

sparse, incomplete, uncertain, complex or dynamic data set from multiple and autonomous

sources.

[Sangeetha, J., & Prakash, V. S. J 2017] surveys about the big data mining techniques,

data slicing techniques and clustering techniques. This survey discusses about the advantages

and drawbacks of the big data mining techniques, data slicing techniques and clustering

techniques. [Saraswathi, S., & Sheela, M. I. 2014], carried out clustering analysis of various

data mining techniques. Cluster analysis or clustering is the task of grouping a set of objects

in such a way that objects in the same group are more similar to each other than to those in

other groups. Clustering is one of the complicated tasks in data mining. It plays a vital role in

a broad range of applications such as marketing, surveillance, fraud detection, Image

processing, Document classification and scientific discovery. Lot of issues related with cluster

analysis such as a high dimension of the dataset, arbitrary shapes of clusters, scalability, input

parameter, complexity and noisy data are still under research.

[Reed, J. W., et al 2004],described multi-agent system to cluster large data sets and

analysed. This technique is then compared to hierarchical agglomerative clustering using a

small set of text data. Results show that the agent-based approach can significantly reduce the

time required to cluster large data sets. [Menéndez, H. D. 2013] introduced Genetic Graph-

based Clustering (GGC), that improves the memory usage while maintaining the quality of

the solution. The new algorithm, called Multi- Objective Genetic Graph-based Clustering

(MOGGC), uses an evolutionary approach introducing a Multi-Objective Genetic Algorithm

to manage a reduced version of the Similarity Graph. The experimental validation shows that

MOGGC increases the memory efficiency, maintaining and improving the GGC results in the

synthetic and real datasets used in the experiments.

In [Skabar, A., & Abdalgader, K. 2013], presented a novel fuzzy clustering algorithm that

operates on relational input data; i.e., data in the form of a square matrix of pairwise

similarities between data objects. The algorithm uses a graph representation of the data, and

operates in an Expectation-Maximization framework in which the graph centrality of an

object in the graph is interpreted as likelihood. In [Saâdaoui, F., et al 2015], introduced a new

set of strategies allowing to simultaneously handle quantitative and qualitative data. The

principle of this approach is to perform a projection of the qualitative variables on the

subspaces spanned by quantitative ones. Subsequently, an optimal model is allocated to the

resulting PCA-regressed subspaces.

In [Prabhu, S. B., & Sophia, S. 2011], given a crisp introduction on clustering process in

WSNs. The survey of different distributed clustering algorithms (adaptive clustering

algorithms) used in WSNs, based on some metrics such as cluster count, cluster stability,

cluster head mobility, cluster head role, clustering objective and cluster head selection is

Page 4: MOVIE REVIEW ANALYSIS AND PREDICTION USING MODIFIED ...€¦ · Recommender systems or recommendation systems are a subclass of information filtering ... data slicing techniques and

J. Sangeetha and Dr. V. Sinthu Janita Prakash

http://www.iaeme.com/IJMET/index.asp 492 [email protected]

done. The study concludes with comparison of few distributed clustering algorithms in WSNs

based on these metrics. In [Yamashita, A. et al 2011], item-based collaborative filtering was

proposed to improve the recommendation accuracy. The unifying approach uses a constant

value as a weight parameter to unify both algorithms. However, because the optimal weight

for unifying is actually different by the situation, the algorithm should estimate an appropriate

weight dynamically, and should use it.

In [Pham, M. C., et al 2011], clustering approach is proposed based on the social

information of users to derive the recommendations. We study the application of this

approach in two application scenarios: academic venue recommendation based on

collaboration information and trust-based recommendation. Using the data from DBLP digital

library and Epinion, the evaluation shows that our clustering technique based CF performs

better than traditional CF algorithms. In [Kesemen, O., et al 2016], the fuzzy c-means

algorithm was adapted for directional data. The main benefit of FCM4DD is that the proposed

method is effectively a distribution-free approach to clustering for directional data.

In [Wu, J., et al 2013], presented a neighborhood based collaborative filtering approach to

predict such unknown values for QoS-based selection. Compared with existing methods, the

proposed method has three new features: 1) the adjusted- cosine-based similarity calculation

to remove the impact of different QoS scale; 2) a data smoothing process to improve

prediction accuracy; and 3) a similarity fusion approach to handle the data sparsity problem.

3. MOVIE REVIEW ANALYSIS AND RECOMMENDATION SYSTEM

The main goal of the proposed research method is to implement the method which can

effectively handle the large volume of datasets and find the similar items, so that

recommendation for end users can be made easily. This research work is carried out on movie

dataset whose main goal is to recommend the movies for the users based on preference. This

is achieved by introducing the novel research technique namely Modified

Collaborative Filtering and Clustering with Regression (MoCFCR). In this research method,

initially k means algorithm is used to cluster the similar movie reviewer together, so that

recommendation process can be done in the easier way. In order to handle the large volume of

data this research work adapts the map reduce framework which will divide the entire data

into subsets which will assigned on separate nodes with individual key values. After

clustering, the clustered outcome is merged together using inverted index procedure in which

similarity between movies would be calculated. Here collaborative filtering is applied to

remove the movies that are not relevant to input. Finally recommendations of movies are

made in the accurate way by using the logistic regression method. The overall flow of the

proposed research method is shown in the following figure 1.

In the figure 1, movie review analysis framework has been given. Here initially large

volume of movie review analysis dataset has been collected. To handle large volume of data

set, dataset will be divided into multiple divisions which are then assigned with key values for

the further processing. Those data are clusters and filtering based similarity to make the

recommendation process efficiently. Finally regression system is applied to perform

recommendation process very efficiently.

Page 5: MOVIE REVIEW ANALYSIS AND PREDICTION USING MODIFIED ...€¦ · Recommender systems or recommendation systems are a subclass of information filtering ... data slicing techniques and

Movie Review Analysis and Prediction Using Modified Collaborative Filtering and Clustering

with Regression

http://www.iaeme.com/IJMET/index.asp 493 [email protected]

Movie review

dataset

Mapper 1 Mapper 2 Mapper n

Remove repeated

items

Remove repeated

items

Remove repeated

items

K means

clustering

K means

clustering

K means

clustering

Reducer 1 Reducer n

Filtering Filtering

User

interest Similarity finding

Regression based recommendation

Figure 1 Overall view of the proposed system

3.1. Clustering Using K Means Algorithm

Data clustering is the partitioning of object into groups (called clusters) such that the

similarity between objects of the same group is maximized and similarity between objects of

different groups is minimized. The goal of the clustering technique is to decompose or

partition a data set into groups such that both intra group similarity and inter-group

dissimilarity is maximized. Each clustering algorithm is based on some kind of distance

measures, which leads to grouping of related objects. The distance measure is used to

determine similarity of object criteria. As each distance measure shows different methods for

defining the degree of comparison between two objects. The K-Means algorithm uses

Euclidean distance to measure the distortion between a data object and its cluster centroid.

Euclidean distance metric is sufficient to successfully group similar data instances. K Means

clustering is a method used to of the most commonly and effective methods to classify data

because of its simplicity and ability to handle voluminous data sets.

3.1.1. Mapreduce Programming

In MapReduce process has two separate steps Map and Reduce steps. Each step is process on

sets of (key, value) pairs. While, the time of program execution is divided into a Map and a

Page 6: MOVIE REVIEW ANALYSIS AND PREDICTION USING MODIFIED ...€¦ · Recommender systems or recommendation systems are a subclass of information filtering ... data slicing techniques and

J. Sangeetha and Dr. V. Sinthu Janita Prakash

http://www.iaeme.com/IJMET/index.asp 494 [email protected]

Reduce stage, each separated by data transfer between nodes in the cluster. In Mapper

function can select the data values as input, applies the function to each value to the given

datasets and generates an output set. The mapper output in the form of (key, value) pairs. The

framework, then, sorts the mapper function outputs and inputs them into a Reducer. This data

transfer between the Mappers and the Reducer. The values are combined at the node running

the Reducer for that key. In Reducer stage produces another set of (key, value) pairs as final

output. The Reducer stage can only process after all data get from the Map process.

MapReduce requires the input as a (key, value) pair that can be divided and therefore, limited

to tasks and algorithms that use (key, value) pairs.

3.1.2. K-Means Clustering

Clustering is a process of grouping with similar objects. Any cluster should exhibit two main

properties that belong to, low inter-class similarity and high intraclass similarity. Clustering

techniques used to group a large number of things together into clusters that share some

similarity. It’s a method to discover hierarchy and order in a large or hard to understand

datasets and in that way reveal are interesting patterns or make the data set easier to

comprehend. Cluster analysis is used in many numbers of applications such as image

processing and data analysis. K-Means is one of the unsupervised learning methods among

partitions based clustering methods. It classifies a given dataset of n data objects in k clusters,

where k is the number of desired clusters. The K-means algorithm gave better results only

when the initial partition was close to the final solution. K-means clustering algorithm follows

the blow steps.

i) Choose a number of desired clusters, k.

ii) Choose k starting points to be used as initial estimates of the cluster centroids. The initial

starting values.

iii) Examine each point (i.e., job) in the workload dataset and assign it to the cluster whose

centroid is nearest to it.

iv) When each point is assigned to a cluster, recalculate the new k centroids.

v) Repeat steps 3 and 4 until no point changes its cluster assignment, or until a maximum

number of passes through the data set is performed.

3.1.3. K Means Clustering Using Mapreduce

In proposed method using k-means clustering algorithm to cluster the data for different type

of dimensional dataset in Hadoop framework and calculate the SSE value for those data. The

k-means algorithm is one of the most effective algorithms for clustering. To find the accuracy

to calculate SSE value while calculating the SSE value is small the given dataset is compact.

The implementation of clustering algorithm also benefits from the possibility to access by the

map reduce framework, so user can use the algorithm with large datasets.

Sum of Squared Error (SSE): The implemented k means clustering algorithm in

MapReduce paradigm based upon the Euclidean distance the result of cluster value can

calculated by SSE to identify the accuracy of cluster.

∑(( ) ( )

)

xi--> x co-ordinate of the points in the cluster.

xc--> x coordinate of the centroid.

yi--> y co-ordinate of the point in the cluster.

yc--> y co-ordinate of the centroid.

Page 7: MOVIE REVIEW ANALYSIS AND PREDICTION USING MODIFIED ...€¦ · Recommender systems or recommendation systems are a subclass of information filtering ... data slicing techniques and

Movie Review Analysis and Prediction Using Modified Collaborative Filtering and Clustering

with Regression

http://www.iaeme.com/IJMET/index.asp 495 [email protected]

Pseudocode - Kmeans Cluster Algorithm

Let n be the number of clusters you wantLet S be the set of feature vectors (|S| is the size of

the set)

Let A be the set of associated clusters for each feature vectorLet sim(x,y) be the similarity

function

Let c[n] be the vectors for our clusters

Init:

Let S' = S

//choose n random vectors to start our clusters

for i=1 to n

j = rand(|S'|)

c[n] = S'[j]

S' = S' - {c[n]} //remove that vector from S' so we can't choose it again

end

//assign initial clusters

for i=1 to |S|

A[i] = argmax(j = 1 to n) { sim(S[i], c[j]) }

end

Run:

Let change = true

while change

change = false //assume there is no change

//reassign feature vectors to clusters

for i = 1 to |S|

a = argmax(j = 1 to n) { sim(S[i], c[j]) }

if a != A[i]

A[i] = a

change = true //a vector changed affiliations -- so we need to

//recompute our cluster vectors and run again

end

end

//recalculate cluster locations if a change occurred

if change

for i = 1 to

nmean, count = 0for j = 1 to |S|

if A[j] == i

mean = mean + S[j]

count = count + 1

end

end

c[i] = mean/count

Page 8: MOVIE REVIEW ANALYSIS AND PREDICTION USING MODIFIED ...€¦ · Recommender systems or recommendation systems are a subclass of information filtering ... data slicing techniques and

J. Sangeetha and Dr. V. Sinthu Janita Prakash

http://www.iaeme.com/IJMET/index.asp 496 [email protected]

end

end

3.2. Colloborative Filtering

Up to now, item-based collaborative filtering algorithms have been widely used in many real

world applications such as at Amazon.com. It can be divided into three main steps, i.e.,

compute rating similarities, select neighbors and recommend services.

Compute Rating Similarity: Rating similarity computation between items is a time-

consuming but critical step in item-based CF algorithms. Common rating similarity measures

include the Pearson correlation coefficient (PCC) and the cosine similarity between ratings

vectors. The basic intuition behind PCC measure is to give a high similarity score for two

items that tend to be rated the same by many users. PCC which is the preferred choice in most

major systems was found to perform better than cosine vector similarity. Therefore, PCC is

applied to compute rating similarity between each pair of services in ClubCF.

Select Neighbors: The bigger value of 𝛾 is, the chosen number of neighbors will relatively

less but they may be more similar to the target service, thus the coverage of collaborative

filtering will decrease but the accuracy may increase. On the contrary, the smaller value of 𝛾

is, the more neighbors are chosen but some of them may be only slightly similar to the target

service, thus the coverage of collaborative filtering will increase but the accuracy would

decrease. Therefore, a suitable 𝛾 should be set for the tradeoff between accuracy and

coverage.

The collaborative filtering algorithm uses T-dimensional vectors, where T is the number

of distinct terms found in the search logs. We assign a number N (t) from 1 to T to each

distinct term in the search logs. For each URL u in the search logs we create a vector U whose

value in the N (t) position is the number of times that a user searched for that term and clicked

on that URL; this value may be zero. We create another vector Q with the same dimension as

U. It contains a 1 in the N (t) position if the term t is in the seed set and 0 otherwise.

We may multiplicatively weight each dimension of the vectors above by a factor

log(n/UF) where n is the number of distinct URLs in the search logs and UF the number of

distinct URLs that users clicked on after searching for the term corresponding to the entry.

This may be thought of as an Inverse URL Frequency, analogous to the Inverse Document

Frequency weighting used in information retrieval algorithms.

We compute T = ∑1 (U) Cos (U, Q), where 1(U) is an indicator vector. 1(U) is of the

same length as U and contains 1 for every non-zero entry of U and zero otherwise. The sum is

over the vector U for every URL in the search logs. Then rank the indices of T from its largest

to smallest entries. The terms corresponding to these indices are ranked from most similar to

least similar. The same collaborative filtering algorithm was applied to terms and URL in the

advertiser database. The term vector in this case consists of all the terms associated with a

URL.

3.3. Logistic Regression Based Recommendation

Logistic regression model using lexical features and features from search logs. Logistic

regression [Kleinbaum, D. G., & Klein, M. 2010] is a widely used generalized linear model

for predicting probabilities. In the logistic regression model, the log it of the probability of

relevance of a term is modeled as a linear function of the feature values. The model is trained

using maximum likelihood.

To simplify the problem, here focuses on predicting whether a user will watch a movie

based on other movies they’ve watched. Thus, for a given input pair (u, m) of user u and

Page 9: MOVIE REVIEW ANALYSIS AND PREDICTION USING MODIFIED ...€¦ · Recommender systems or recommendation systems are a subclass of information filtering ... data slicing techniques and

Movie Review Analysis and Prediction Using Modified Collaborative Filtering and Clustering

with Regression

http://www.iaeme.com/IJMET/index.asp 497 [email protected]

movie m, we want to predict whether the user will (0) or will not (1) watch the movie. As a

logistic regression model can also predict the probability of the interaction in addition to a

binary label, we can use the predicted probabilities to sort the movies in terms of users and

recommend some fixed number of top-ranked movies.

Let’s say that we have N users, M movies, and K features per movie. For each user, we

can define a feature vector with MK entries. When we want to predict interaction for a pair (u,

m) of user u and movie m, only features with indices in [mK,(m+1)K) will be “on” (have non-

zero values). The other M(K−1) features will be set to zero. This enables us to

pack MM separate logistic regression models into a single logistic regression model. In our

case, for each movie mm, we’ll use the interactions with the M−1 other movies as the

features. Thus, K=M−1 and our vector will have M(M−1) entries. This should immediately

raise a concern about memory usage – the number of features will scale quadratically with the

number of movies. In fact, this is one reason why you wouldn’t to train a separate model for

each movie in the first place anyway.

This is where feature hashing comes to our rescue. The interactions are very sparse; a

small percentage of the M−1 features for each movie are likely to be on. Instead of having a

vector of M (M−1) entries for each user-movie pair (u,m), we can create a small vector

of 2B entries (where B is a new parameter for the model). For each movie i≠m that the user

has watched, we can encode the ids m and i as a string s (e.g., 23432_768). By hashing the

string, we can find the index in the feature vector idx = hash(s) % 2**B and increment the

count at that index. As with most of our blog posts, we implemented a proof of concept using

the wonderful scikit-learn. We used the SGDClassifier with a L2 penalty and log loss.

Features for each user-movie pair were hashed with the FeatureHasher extractor. The model

was trained in an online fashion, with each a batch formed for each user from the user’s

positive and negative examples. We released the implementation on GitHub under the Apache

v2 License. The overall flow of the proposed research method is given in the following

algorithm.

Algorithm: Overall flow of the proposed research method

1. Initial

(i) The given input data set can be split into sub datasets. The sub datasets are formed into <Key, Value> lists. And

these <Key, Value> lists input into map function.

(ii) Select k points randomly from the datasets as initial clustering centroids.

2. Mapper

i) Update the cluster centroids. Calculate the distance between the each point in given datasets and k centroids.

ii) Arrange each data to the nearest cluster until all the data have been processed.

iii) Output <ai, zj> pair. And ai is the center of the cluster zj.

3. Collaborative Filtering

i) Compute R_sim (st, sj) using Pearson correlation coefficient if st and sj belongs to the same cluster and compute

R_sim’(st, sj) by weighting R_sim (st, sj)

ii) Select services whose enhanced rating similarity with st exceed a rating similarity threshold 𝛾, and put them into a

neighbours set

iii) The logit of the probability of relevance of a term is modeled as a linear function of the feature values

iv) The model is trained using maximum likelihood

v) Compute the predicted rating of st for an active user. If the predicted rating exceeds a recommending threshold, it

will be recommended to the active user

3. Reducer

(i) Read <ai, zj> from Map stage. Collect all the data records. And then output of k clusters and the data points.

(ii) Calculate the average of each cluster which is selected as the new cluster center.

(iii) Calculate the new centroids with the original centroids in the same cluster. If the value is smaller than the

threshold or the number of iterations of the algorithm has reached the maximum, the algorithm will stop.

Otherwise, the new cluster centroids points are used to update the original centroids. Return to map stage, and

continue the algorithm until merging.

Page 10: MOVIE REVIEW ANALYSIS AND PREDICTION USING MODIFIED ...€¦ · Recommender systems or recommendation systems are a subclass of information filtering ... data slicing techniques and

J. Sangeetha and Dr. V. Sinthu Janita Prakash

http://www.iaeme.com/IJMET/index.asp 498 [email protected]

4. EXPERIMENTAL RESULTS

In this section overall research of the proposed work has been experimented and its results are

evaluated by comparing it with the already existing clustering approaches. In our research

work, movie review data set is used for the experimental analysis. MovieLens data sets were

collected by the GroupLens Research Project at the University of Minnesota. This data set

consists of:

* 100,000 ratings (1-5) from 943 users on 1682 movies.

* Each user has rated at least 20 movies.

* Simple demographic info for the users (age, gender, occupation, zip)

The data was collected through the MovieLens web site (movielens.umn.edu) during the

seven-month period from September 19th, 1997 through April 22nd, 1998. This data has been

cleaned up – users who had less than 20 ratings or did not have complete demographic

information were removed from this data set. Detailed descriptions of the data file can be

found at the end of this file.

The proposed clustering process is executed on the above mentioned data set to identify

its performance. The hadoop is used to develop the proposed methodology in terms of various

performance measures. The existing algorithms that are used to compare with the proposed

methodologies to analyze the performance improvement are k-means, k-mediod, density

based clustering and Clustering-based Collaborative Filtering approach (ClubCF). The

performance metrics used for comparison analysis are listed as follows:

Computation time

Mean absolute error

These performance measures are evaluated for both the proposed and existing

methodologies to analyze and predict the performance improvement. The comparison

evaluation is discussed in depth in the following sub sections.

4.1. Computation Time

Computation time (also called "running time") is the length of time required to perform

a computational process. Representation a computation as a sequence of rule applications,

the computation time is proportional to the number of rule applications.

Figure 2. Computation time comparison

Page 11: MOVIE REVIEW ANALYSIS AND PREDICTION USING MODIFIED ...€¦ · Recommender systems or recommendation systems are a subclass of information filtering ... data slicing techniques and

Movie Review Analysis and Prediction Using Modified Collaborative Filtering and Clustering

with Regression

http://www.iaeme.com/IJMET/index.asp 499 [email protected]

In the above figure 2 computation time evaluation is conducted for the existing and

proposed research techniques. It is evaluated and compared for the varying number of clusters

from which it can be proved that the proposed research technique can perform

recommendation with reduced computation time than the existing research techniques for the

varying number of methodologies.

4.2. Mean Absolute Error

In statistics, mean absolute error (MAE) is a measure of difference between two continuous

variables. The Mean Absolute Error is given by: It is possible to express MAE as the sum of

two components: Quantity Disagreement and Allocation Disagreement. Quantity

Disagreement is the absolute value of the Mean Error.

Figure 3. Mean absolute error

In the above figure 3, mean absolute error comparison evaluation has been done between

the existing and the proposed research techniques. From this comparison analysis, it can be

proved that the proposed research method can lead to provide the better outcome than the

existing research methods.

5. CONCLUSIONS

In the proposed research method Modified Collaborative Filtering and Clustering with

Regression (MoCFCR) method is introduce for the accurate movie review classification. In

this research method, initially k means algorithm is used to cluster the similar movie reviewer

together, so that recommendation process can be done in the easier way. In order to handle the

large volume of data this research work adapts the map reduce framework which will divide

the entire data into subsets which will assigned on separate nodes with individual key values.

After clustering, the clustered outcome is merged together using inverted index procedure in

which similarity between movies would be calculated. Here collaborative filtering is applied

to remove the movies that are not relevant to input. Finally recommendations of movies are

made in the accurate way by using the logistic regression method. The overall evaluation of

the proposed research method is done in hadoop from which it can be proved that the

proposed research technique can lead to provide better outcome than the existing research

techniques. In future works, more concentration can be given on partitioning of data where

data’s with more dependant values would lead to in accuracy in prediction. It will be more

effective if the correlation between the data are considered in future for further processing.

Page 12: MOVIE REVIEW ANALYSIS AND PREDICTION USING MODIFIED ...€¦ · Recommender systems or recommendation systems are a subclass of information filtering ... data slicing techniques and

J. Sangeetha and Dr. V. Sinthu Janita Prakash

http://www.iaeme.com/IJMET/index.asp 500 [email protected]

REFERENCES

[1] Ricci, F., Rokach, L., & Shapira, B. (2011). Introduction to recommender systems

handbook. In Recommender systems handbook (pp. 1-35). Springer US.

[2] Isinkaye, F. O., Folajimi, Y. O., & Ojokoh, B. A. (2015). Recommendation systems:

Principles, methods and evaluation. Egyptian Informatics Journal, 16(3), 261-273.

[3] Khorasani, E. S., Zhenge, Z., & Champaign, J. (2016, December). A Markov chain

collaborative filtering model for course enrollment recommendations. In Big Data (Big

Data), 2016 IEEE International Conference on (pp. 3484-3490). IEEE.

[4] Yang, Z., Wu, B., Zheng, K., Wang, X., & Lei, L. (2016). A Survey of Collaborative

Filtering-Based Recommender Systems for Mobile Internet Applications. IEEE Access, 4,

3273-3287.

[5] Sangeetha, J., & Prakash, V. S. J. (2017). A Survey on Big Data Mining Techniques.

International Journal of Computer Science and Information Security, 15(1), 482.

[6] Saraswathi, S., & Sheela, M. I. (2014). A comparative study of various clustering

algorithms in data mining. International Journal of Computer Science and Mobile

Computing, 11(11), 422-428.

[7] Reed, J. W., Potok, T. E., & Patton, R. M. (2004, May). A multi-agent system for

distributed cluster analysis. In Proceedings of Third International Workshop on Software

Engineering for Large-Scale Multi-Agent Systems (SELMAS’04) W16L Workshop-26th

International Conference on Software Engineering (pp. 152-155).

[8] Menéndez, H. D., Barrero, D. F., & Camacho, D. (2013, June). A multi-objective genetic

graph-based clustering algorithm with memory optimization. In Evolutionary

Computation (CEC), 2013 IEEE Congress on (pp. 3174-3181). IEEE.

[9] Skabar, A., & Abdalgader, K. (2013). Clustering sentence-level text using a novel fuzzy

relational clustering algorithm. IEEE transactions on knowledge and data engineering,

25(1), 62-75.

[10] Saâdaoui, F., Bertrand, P. R., Boudet, G., Rouffiac, K., Dutheil, F., & Chamoux, A.

(2015). A dimensionally reduced clustering methodology for heterogeneous occupational

medicine data mining. IEEE transactions on nanobioscience, 14(7), 707-715.

[11] Prabhu, S. B., & Sophia, S. (2011). A survey of adaptive distributed clustering algorithms

for wireless sensor networks. International Journal of Computer Science and Engineering

Survey, 2(4), 165.

[12] Yamashita, A., Kawamura, H., & Suzuki, K. (2011). Adaptive fusion method for user-

based and item-based collaborative filtering. Advances in Complex Systems, 14(02), 133-

149.

[13] Pham, M. C., Cao, Y., Klamma, R., & Jarke, M. (2011). A clustering approach for

collaborative filtering recommendation using social network analysis. J. UCS, 17(4), 583-

604.

[14] Kesemen, O., Tezel, Ö., & Özkul, E. (2016). Fuzzy c-means clustering algorithm for

directional data (FCM4DD). Expert Systems with Applications, 58, 76-82.

[15] Wu, J., Chen, L., Feng, Y., Zheng, Z., Zhou, M. C., & Wu, Z. (2013). Predicting quality

of service for selection by neighborhood-based collaborative filtering. IEEE Transactions

on Systems, Man, and Cybernetics: Systems, 43(2), 428-439.

[16] Kleinbaum, D. G., & Klein, M. (2010). Analysis of matched data using logistic regression.

In Logistic regression (pp. 389-428). Springer New York.