Review based Feature Matrix for Predicting ratings in Recommender System

—Recommender systems are acquiring extensive popularity and have become essential component of on-line business handling tools because of their capability of providing personalized guidance in selecting products and services. Collaborative Filtering that brings in the popularity aspect of the item amongst the user base, heavily depends on the ratings provided by the users, while Content - based Filtering that brings out the item's features matching the user's taste, requires content information as also user’s preference information. Service providers usually invite users to share their experience about the use of service in the form of reviews and ratings. Reviews which are verbose contain a rich source of information about the service's features as also user's preferences while ratings are usually sparse due to user’s reluctance to quantify. A feature matrix generated by processing review information using semantic similarity based on synsets can be used along with sparse ratings to generate the complete predicted ratings matrix. The paper presents a modified matrix factorization approach for recommendations using the review based feature matrix


II. BACKGROUND
Recommender systems (RS) facilitate prospective buyers to select any product or service having features matching his or her choice as also giving consideration to the popularity amongst other users [3] [4]. It has varied applications like in purchasing product (Amazon), listening a song (Last.fm) or selecting a hotel (TripAdvisor) etc. [2]. RS uses three common techniques Collaborative Filtering(CF), Content based filtering and more recently a combination of above two techniques that is Hybrid filtering. Collaborative Filtering uses two approaches neighborhood based and model based. Neighborhood based methods are user based or item based. The userbased methods use ratings to link a user with a set of like-minded users. It recommends to the new user a set of items that are liked by her/his neighbors; in item-based method, items that are similar to those that a user has viewed/purchased before, are recommended. Model-based CF focuses on learning the latent factors that represent users' inherent preferences over an item's multiple dimensions [5]. Model based methods perform well when there is sufficient rating information. Most product and service providers collect the user feedback in qualitative form as a textual review or quantified form as ratings. While some users find it easy to rate a service, most users prefer to share their experience of the use of service as a review. While ratings can be easily manipulated, reviews represent user sentiments in a more reliable manner [6]. Recently several researchers have used the valuable information hidden in reviews to address the rating problems. The most commonly used approach is to identify frequently occurring terms in reviews as indicators of item as well as reviewer characteristics [7]. Another approach is to identify review topics which can be carried out by using a frequency-based approach on extracted terms or phrases [8] or using topic modeling approach such as Latent Dirichlet Allocation(LDA) [9]. Opinion mining is another approach where positive or negative sentiments of the user about the item can be identified by aggregating sentiments of all opinion words [10]. The helpful reviews as voted by other users can be given higher weightage thus improving predictions [11]. Matrix factorization methods are used as unsupervised learning methods for latent variable decomposition and dimensionality reduction. It maps both users and items to a joint latent factor space of dimensionality [12]. A vector qi is related with item i and a vector pu is linked to each user u qi is the degree to which the item has the features. pu is the degree to which user is interested in item features The result of dot product qi Tpu is termed as the interaction between user u and item i -the user's overall interest in the item's characteristics rui denotes user u's rating to item i, The matrices can be used to compute the recommendation score for any user and item [13][14] [15].Machine learning techniques can be used to generate the parameters that govern the relationship between item and features as in qi and also between user and features that is pu using known ratings data. In this paper, the review based feature matrix can replace qi while the available sparse ratings can be used to generate the parameters that govern the relationship between the users and item features.

III. REVIEW BASED FEATURE MATRIX GENERATION
The first step in review based recommendation is to generate the feature matrix. The reviews are preprocessed by removing stop words, stemming list etc [18] [19]. All the terms with their frequencies are identified and term set is generated. Synonyms and Meronyms are identified for each term and assembled into groups to form the synsets. The synsets having group frequency greater than the threshold frequency are chosen to represent the columns and reviews are placed in rows [20]. The synset document matrix containing the group frequency count is normalized to form the feature matrix.

A. Data Collection
The dataset used is from the tar file having 12,773 reviews of hotels. These were downloaded from the tripadvisor site. The data is in JSON format. [http://sifaka.cs.uiuc.edu/~wang296/Data/index.html]. In each hotel's data file, there are approximately 100-300 reviews of that particular hotel by different users. The dataset contains ratings and reviews, author name, location including country and state, short review1, short review2, author review ID, Name of hotel, Descriptive review. The Hotel table contains id, name, URL, price, address etc. The problem related to dataset is that the reviews are quite extensive and the ratings are very sparse.

B. Feature Matrix generation
Feature Matrix (X) is a number of hotels (nh) by number of features (nf) matrix containing the normalized feature quantification for each hotel and each set of features.Reviews contains terms that describe the features of an item or service. Though a single feature indicator term, may not be frequent enough in a review but may be repetitively used across the reviews. The different users may not use the same terms but may use synonym or meronym of the same term. The synset grouping thus helps in bringing together different terms indicating the same feature. As reviews contained a lot of location specific and other hierarchies, both synonyms and meronyms were effectively used in forming the synset groups. Table 1 shows the feature matrix containing normalized values for 28 features and 10 hotels. Table 2 shows the feature matrix containing normalized values for 34 features and 10 hotels. For comparative study the feature matrix is generated using two approaches, in the first approach ten reviews per hotel were considered and in the second approach 40 reviews were considered for each hotel. The feature matrices generated by the first and second approach are shown in Table I and Table II respectively. In the second approach synset grouping becomes stronger because synonyms of terms are repeated across reviews of same hotel thus strengthening the frequency count. The number of features increased from 22 for single review to 28 for 10 reviews and 34 for 40 reviews as shown in various shading in Table VI

IV. PREDICTIVE MODELING APPROACH
In this section the predictive approach is presented for recommendations using the feature matrix and the available user ratings. For each user we need to learn the parameter vector Theta that governs the relationship between the features and the rating. Rating (Y) is a number of users (nu) by number of hotels (nh) matrix containing the rating given by user to the hotel. The rating is usually a number between 1 and 5. There may be several missing values which are indicated by a '?'. The table III shows the sparse rating matrix. For computational purpose an Indicator matrix (R) is used which is a binary valued matrix used to indicate the presence and absence of ratings. If the user i has rated hotel j, thenR(i, j)=1 and 0 otherwise. Parameter Matrix (θ) is a number of user (nu) by number of feature (nf) matrix containing the parameter values that govern the relationship between the features and ratings. Predicted Ratings (P) is a number of users (nu) by number of hotels (nh) matrix containing the predicted ratings for each user and each hotel. The P matrix can be computed once the parameter values are available using the following formula. P = θXTranspose For learning the parameters the Gradient descent approach is used A. Gradient Descent Approach The machine learning algorithm tries to minimize the cost function. The cost is computed as the mean squared error between the predicted and the actual ratings [16][17]. Mean squared error (J) can be computed using Equation (1) J ∑ θX y , : , The gradient decent approach iteratively modifies the θ parameters by adding the gradient which is the partial derivative of the error. Gradient (grad) for each parameter θ can be computed using the Equation (2) At each iteration the error decreases, so that after finite number of iterations, the parameters that best fit the data can be obtained. The cost function and gradient are modified by using the Equation (3) Choosing Regularization parameter lambda (λ) To avoid overfitting of the model to the training data, regularization parameter governed by lambda is The lambda (λ) parameter is chosen after executing algorithm for different lambda (λ) values.
[17] The learning curve for for different values of lambda (λ) for ten reviews based feature matrix is shown in the fig 1. and the learning curve for training and cross validation error for different values of lambda (λ).

B. Predicted Ratings
The matrix factorization algorithm is implemented on octave platform and training is carried out using built-in octave functions. The gradient decent algorithm was executed using two feature matrices. Table IV and Table V shows predicted ratings for the feature matrix which is based on a ten reviews and forty reviews respectively. Variations in the predicted ratings can be clearly observed in Table V.  Africa, Barbary V. CONCLUSION The paper presents an approach to predict the ratings based on reviews when ratings are sparse. Semantic similarity between the terms is used to generate the feature matrix from reviews. The predicted rating matrix produced using two types of feature matrices. The variations in the predicted ratings are presented. The approach is based on modified matrix factorization for recommendations using the feature