A Computational Approach to Find Deceptive Opinions by Using Psycholinguistic Clues

— The product reviews and the blogs play a vital role in giving the insight to end user for making a decision. Direct impact of reviews and ratings on the sale of the product raises a strong possibility of fake reviews. E-commerce sites are often indulged in writing fake reviews to promote/demote particular products and services. These fictitious opinions that are written to sound authentic are known as deceptive opinion/review spam. Review spam detection has received significant attention in both business and academia due to the potential impact fake reviews can have on consumer behaviour and purchasing decisions. To curb this issue many e-commerce companies have even started to certify the reviewers. But it covers an only small chunk of reviewers, so this technique couldn’t be enough to deal with the problem of deceptive opinion spamming. Manually, it is difficult to detect these deceptive opinions. This work primarily focuses on enhancing the accuracy of existing deceptive opinion spam classifiers using psycholinguistic/sociolinguistic deceptive clues. We have formulated this problem in different ways and solve them with many machine learning techniques. This work carried out up on the publicly available gold standard corpus of deceptive opinion spam and achieved up to 92 percent cross-validation accuracy in restaurants and around 94 percent in hotels domain by the final classifier. A detail comparative results analysis has been done for all used machine learning algorithms.

Word play is deceptive and so is the human being. Review Language plays a major role in identifying the hidden intentions. Our main focus behind this work is to explore the use of psycholinguistic/sociolinguistic features in order to study the deceptive behaviour of the reviewer. A lot of study has been done by the linguists and psychologists to find verbal and non-verbal clues to deception [2] and establish an association between psycholinguistic features and deception. However in our opinion many of these associations have not been utilized for opinion spam detection. On this basis, we propose an intermediate layer where we identified various computational psycholinguistic measures/matrices and identified their association with the deceptive behaviour of the person based on the studies conducted earlier. To achieve our objective, we build with different computational matrices and on benchmark dataset (classified opinions as spam and non-spam), we observed that these measures are significantly different in spam and non-spam reviews. These measures were used as features for training and testing various machine learning models. This work mainly focuses on:  Formulation of opinion spam detection problem in different ways: genre identification (Informative vs. imaginative writing), linguistic deceptive detection, and traditional text classification problem.  Use of psycholinguistic/sociolinguistic features such as emotion, negativity, tension, anger, personal concern, tone, etc. to understand the intention of the reviewer.  Use of readability and lexical diversity as features in the context of opinion spamming. We have observed that these measures can contribute significantly towards detecting deceptive reviews. However, in our knowledge no preliminary study has been reported on the application of these measures in opinion spamming domain.  Use of SVM (support vector machine), SLDA (stabilized linear discriminant analysis) and ensemble learning techniques to detect opinion spam. We have performed experiment on restaurant and hotel domain on Myle Ott's gold standard dataset [3]. A comparative study and analysis of each approach and corresponding result is given. The rest of the chapter is organized as follows. The second section describes various works related to opinion spamming considering different approaches. Section 3 explains feature identification, construction and justifies their use both logically and statistically. Section 4 includes problem formulation and classification methodology that we have used in this work for deceptive spam detection. Section 5 contains experimental details along with statistical analysis of the result. The last section comprises of the conclusion as well as the future work.

II. RELATED WORK
The basic definition of spamming refers to web spam that includes email spam or search engine spam which indulges in the action of misleading search engines to rank some web pages higher than they deserve [4].Going beyond the basic definition, Spamming also includes opinion spamming which is comparatively a new field of research. Though a lot of research is going on into the field of opinion mining and sentiment analysis. But only a few of these studies have focused on opinion spam problem and more specifically on deceptive opinion spam detection. Preliminary research has been reported on Amazon reviews [5]. They re-framed the review spam identification problem as duplicated reviews identification problem. Previous attempts for spam/spammer detection used reviewer's behaviors, text similarity, linguistics features, review helpfulness and rating patterns.
One of the finest works in the field of deceptive opinion spam identification has been done by integrating psychology and computational linguistics [3]. The author claimed that best performance was achieved by using psychological features with support vector machine (SVM) to detect the deceptive spam with accuracy up to 89 percent on hotel domain. They have also contributed a large-scale publicly available gold standard data set for deceptive opinion spam research.
In another approach, author proposed a complementary model to existing approach for finding subtle spamming activities [6]. Thus, it can be combined with other textual feature-based models to improve their accuracy. In their work, authors proposed a novel concept of a heterogeneous review graph and claimed to capture the interrelationship among reviewers, reviews, and stores that the reviewers have reviewed. This model tries to identify suspicious reviewer by exploring nodes of the graph. It also tried to establish the relationship between trustiness of reviewers, the honesty of the review and the reliability of the store. This work has achieved the precision up to 49 percent. However, authors claimed to identify those suspicious spammers that couldn't detect by other existing techniques.
As earlier studies suggest, ratings have a high influence on revenue. Higher rating results in higher revenue. Many companies are indulging in insidious practices to get undue benefits. Unfair and biased rating pattern has been studied in several previous works [7], [8]. In one of the approach author identified several characteristics behavior of review spammer and model this behavior to detect the spammer [9] . They derived an aggregated behavior scoring methods to rank reviews according to the degree they demonstrate the spamming behavior. Their study shows that by removing reviewers with very high spam sources, the highly spammed products and product group has experienced significant changes in aggregate rating compared with removing randomly scored or unrelated reviewers.
Another approach may involve capturing the general difference of language usages between deceptive and truthful reviews [10]. This model tried to include several domain independent features that allow formulating general rules for recognizing deceptive opinion spam. They used part of speech (POS), psychological and some other general linguistic cues of deception with SAGE [11] and SVM model. The dataset used in this work include following domains, namely hotel, restaurant, and doctor. SAGE achieved much better result than SVM and were around 0.65 accurate in the cross-domain task. Another model that integrates some deep linguistic features derived from syntactic dependency parsing tree was proposed to discriminate deceptive opinions from normal ones [12]. They worked on Ott's data set and a Chinese data set and claim to produce a state of art results on both of the topics.
Opinion spamming can be done individually or may involve a group [13]. Group spamming can be even more damaging as they can take total control of the sentiment on the target product due to its size. Their work was based on the assumption that a group of reviewers works together to demote or promote a product. The author has used frequent pattern mining to find a candidate spammer group and used several behavioral model derived from the collusion phenomenon among fake reviews and relation models.

III. THEORETICAL FRAMEWORK FOR FEATURE IDENTIFICATION AND CONSTRUCTION
In our work, we have considered various well-defined readability, lexical diversity and psychological features along with n-grams measures. Each of these measures can be used to characterize the review. These characteristic measures have been used as features of the review. This work is based on the observation that these features help us to distinguish between deceptive and truthful reviews.

A. Readability
The creator of the SMOG readability formula G. Harry McLaughlin defines readability as: "the degree to which a given class of people find certain reading matter compelling and comprehensible [14]." It was in 1937 when US government for the first time decided to grade civilians rather than considering them as either literate or illiterate. According to National Center for Educational Statistics (1993), average US citizen reads at the 7thgrade level and when it comes to writing it degrade even further. It has been observed that a review written by average US citizen contains simple, familiar words and usually, fewer jargons compare to one written by professionally hired spammer. This simplicity and ease of words lead to better readability. In particular, we will test the hypothesis that all else equal, higher readability will be associated with the fewer chances of spam.
Various readability metrics have been suggested to identify the readability of text. Among them, we have considered only a few well establish readability metrics [14], [15]. To be specific, we computed Automated Readability Index (ARI), Coleman Liau Index (CLI), Chall Grade(CG), SMOG, Flesch-Kincaid Grade Level (FKGL) and Linsear (LIN). As a whole, readability features have been referred as READ throughout this paper. Table 1 below shows the statistical measures for the readability matrices or restaurant domain with respect to truthful and deceptive opinions. Statistics in Table 1 show a significant difference in ARI( two tailed t-test p=0.0045), CLI(two tailed t-test p=0.03), CG(two tailed t-test p=0.02),SMOG(two tailed t-test p=0.01),FKGL(two tailed t-test p=0.01) and LIN(two tailed t-test p=0.03) for truthful and deceptive reviews.

B. Lexical Diversity
Lexical diversity is another text characteristic that can be used to distinguish between deceptive and truthful opinions. The more varied vocabulary a text possesses, the higher is the lexical diversity of that text. For a text to be highly lexically diverse, the word choice of the writer needs to be different and diversified with less repetition of the vocabulary.Moreover, previous researchers have shown that lexical diversity is significantly higher in writing than in speaking [16], [17]. According to the different studies, lexical diversity is genresensitive [17]. Various search engine optimization(SEO) companies are hired to influence products rating to give undue benefits to hiring companies. They write fake reviews to manipulate customer's opinion about the particular products. When done individually or in a group, an employee writes more than one review to make a significant impact. So these reviews have higher similarity and less lexical diversity. Not only this when they have to write reviews of those products or services of which they are not aware of, then they tend to borrow the vocabulary from the previously written reviews. This phenomenon also leads to low lexical diversity. However, in the case of truth teller, they come with a fresh idea, honest opinion, and experience that leads to higher lexical diversity in comparison to liars.
Numerous metrics for measuring lexical diversity exist and each of them has its pros and cons. For example, the traditional lexical diversity measure is the ratio of different words (types) to the total number of words (tokens), the so-called type-token ratio, or TTR [18]. Text samples containing a large number and tokens give a lower value to TTR and vice versa because of its sensitivity to sample size. While D measure which was developed by Brian Richards and David Malvern [19] is independent of sample size but it is also being criticized for being insensitive to sample size [20]. Herdan's C LogTTR G. Herdan , 1960 Even as a traditional classifier feature, lexical diversity can play a significant role. Here we tried to find that how effective lexical diversity is to identify deceptive opinion spam. The combination of all of the lexical diversity metrics is referred as LEX in this paper. Further we will test the hypothesis that all else equal, higher lexical diversity will be associated with the fewer chances of spam. Table 4 below shows various lexical diversity measures for restaurant domain. Statistics shows a significant difference in TTR( two tailed t-test p=0.0272), CTTR(two tailed t-test p=0.0325), MA-TTR(two tailed t-test p=0.0288),MS-TTR(two tailed t-test p=0.0316),log-TTR(two tailed t-test p=0.0173), R(two tailed t-test p=0.0334), S(two tailed t-test p=0.0247), and U(two tailed t-test p=0.005) for truthful and deceptive reviews.

C. Psychological and linguistic features
It's a well-known fact that lying is undesirable, decent people rarely r lie. And this lack of practice makes them a poor liar. While falsehood communicated by mistake are not lies. People lie less often about their actions, experience and plans. And if they do so, they do lie in pursuit of material gain or to escape the punishment. Deception can be defined as a task to mislead others. People behave in quite different ways when they are lying compared to when they are telling the truth. Practitioners and laypersons have been interested in these differences for centuries [21]. In 1981, Zuckerman, DePaulo, and Rosenthal published the first comprehensive meta-analysis of cues to deception [6]. They reported a huge difference of verbal and nonverbal cues occurred in deceptive communications compared with truthful ones.This study shows that liars make a more negative impression and are more tense. Michal Woodworth revealed that liar produced more sense-based words [22]. In other words, deceptive reviewers are more subjective than the truthful ones. Deceptive liars also use fewer self-oriented words (I, me, mine, we, etc.) but more other-oriented words(You, they, etc.). According to study on deception, liars offer fewer details than the truth teller, not only because they have less familiarity with the domain but also to allow for fewer opportunities to be disproved [23].
To fetch psychological features from text reviews, we have used Linguistic Inquiry and Word Count (LIWC) [24] .It is a transparent text analysis program that counts words in psychologically meaningful categories. Empirical results using LIWC (version 2015) demonstrate its ability to detect meaning in a wide variety of experimental settings, including to show attentional focus, emotionality, social relationships, thinking styles, and individual differences. It is among most popular text analysis tool in social sciences. It has categorized its entire output variable into linguistic processes, psychological processes, personal concerns and spoken categories. We have used its linguistic process (LIWC ling ) and psychological process (LIWC psy ) feature sets. Psychological and linguistic features of LIWC jointly has been referred as LIWC all in this paper. Table 6 shows a list of few LIWC features.

D. N-Gram
To get the context of the review we have used unigrams (UG) and bigrams (BG). Some generic preprocessing like removing stop words, extra white spaces are done before generating DTM (Document-Term Matrix). Top UG and BG were filtered based on their term frequency and inverse document frequency score. Jointly we have referred UG and BG as N-gram(NG) in this paper.
IV. PROPOSED WORK As discussed earlier this paper primarily focusses to improve opinion spam classifiers accuracy by identifying domain independent lingustic and psycholinguistic features. This section of proposed work is divided in three sub sections,. The first subsection focuses on feature identification and construction with the explanation of their significance for opinion spam detection. The second subsection deals with possible ways for problem formulation and explains different strategies and their corresponding feature set to solve them. The third subsection talks about various classification methods used in this work.

A. Problem Formulation
There are various ways to formulate the problem of detecting opinion spam. Opinion spam can be identified by either using duplicate detection or using classification techniques. Many of the existing literature over opinion spamming have framed opinion spam identification as duplicated opinion identification problem. However, this assumption is not appropriate [33]. Based on the type of spam, this paper reports the study on deceptive opinion spamming. We have tackled the problem to identify opinion spam detection in following three ways.

1) Genre identification between informative vs. creative/imaginative writing
The problem of finding deceptive opinion spam can be constituted as genre identification task that whether it's imaginative or informative writing. Imaginative writing is quite different from informative writing. Imaginative writing relies heavily on imagination and motive behind it. It includes representation of ideas, feelings and mental images in words.People behave differently, when have to write something that they have not experienced. For example, when you imagine something rather than experiencing it, you tend to be more negative and tense. However, informative writing comprises much of truth, facts, and experience. It primarily provides information through explanation, description, argument and analysis. An imaginative writing might use metaphor to translate ideas and feelings into a form that can be communicated effectively. We can easily relate imaginative writer to the deceptive reviewer who leaves clues such as more sense based words, lesser facts, etc.
Psychological features can play a vital part to distinguish between deceptive and truthful review. People lie most frequently about their feelings and their preference, but less often about their experience, actions and plans. And their lie is clearly visible in their writing when they write a false review about their experience of a product or service. On the other hand,Studies suggest lexical diversity is genre-sensitive [17]. As discussed earlier, vocabulary richness would be higher in informative reviews because of originality in their content. On the other hand when someone tries to write something she/he has not experienced then he might borrow the words and use them repetitively that leads to low lexical diversity. That's why we have used LIWC psy and LEX feature sets to train our classifiers for genre identification problem.

2) Linguistic deception detection
This whole problem can also be treated as linguistic deception detection. It focuses upon how effectively linguistic features alone can detect deception. The study suggests that to the extent that liars deliberately try to control their feelings, expressive behavior, and thoughts. Higher are the chances that their performance would be compromised [2]. They would seem less forthcoming, less convincing, less pleasant and more tense. Deceptive spammer leaves various linguistic cues when lies about something. To obtain linguistic deceptive cues we have used LIWC ling feature set. LIWC ling features subsume most of the linguistic features used in the previous research works. Apart from these, we have also used READ feature set. With both of these feature sets, we have developed our linguistic classifiers for this approach.

3) Traditional text classification problem
In a most traditional way, this problem can be constructed as text classification problem using various feature sets. We trained various classifiers with all possible combination of our feature sets. Rather than reporting all classifiers we have enlisted only top performing ones.

B. Classifiers
This section describes various machine learning approach used in this work. For the given set of features, we have trained SVM, stabilized linear discriminant analysis (SLDA), random forest (RF), decision tree(DT), neural network(NN), maximum entropy(ME), bagging and boosting for all three approaches mentioned earlier.
Out of all these classifiers SVM, SLDA, RF, bagging and boosting performed better than the rest.
SVM [25] is one of the the most powerful technique for non-linear classification. SVM has performed in the related work [5]. It tries to find optimal separating hyperplane between the classes. It uses kernel methods to map the data into higher dimensions using some non-linear mapping. We have used C++ implementation by Chih-Chung Chang and Chih-Jen Lin with C-classification and RBF kernel. Data are scaled internally to zero mean and unit variance for better class prediction. (2) SLDA is Linear discriminant analysis based on left-spherically distributed linear scores. We have used the implementation of LDA for q-dimensional linear scores of the original p predictors derived from the PCq rule [18].
Apart from SVM and SLDA we also have focused on ensemble methods bagging, boosting and random forest. Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution of all trees in the forest. Significant improvements in classification accuracy have resulted from growing an ensemble of trees and letting them vote for the most popular class. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them [26] . While bagging combines multiple classification models or same model for different learning sets. In bagging the final classification is the most often predicted class, voting by these classifiers. Boosting also combines the result from multiple classifiers, but it uses to derive weights to combine the predictions from those models into a single prediction or predicted classification. In both bagging and boosting, we have used decision tree as an individual classifier.

C. Dataset
As mentioned earlier, we have used the publicly available gold standard deceptive opinion spam corpus for our experiments [3]. This data set is generated through crowdsourcing and domain expert. To construct the dataset, Author mined the truthful reviews of 20 hotel near Chicago from TripAdvisor following the work of Yoo and Gretzel [27]. While, to solicit deceptive reviews, they used anonyms online workers (knowns as turkers). These turkers were told to assume themselves as an employee in the marketing department of the company. These turkers were paid one dollar to write a fake review for the hotel/restaurant. The earlier version of the dataset has reviews of hotels domain only (400 truthful, 400 deceptive). The current version of the dataset have reviews from restaurant domain (200 truthful, 200 deceptive reviews) and hotel domain(800 truthful, 800 deceptive reviews). We have performed our experiments on a current dataset for both domains. The baseline results are shown on an earlier version of the dataset on hotels domain.

V. EXPERIMENTS AND RESULTS
This work is an extension of myle ott's work on finding deceptive opinion spam. In that work, NG and psycholinguistic features is used to achieve the best accuracy on SVM. This accuracy we have used as baseline result for our experiments as shown in Table 9. They used psycholinguistic features extracted from earlier version of LIWC (version 2007) which we referred as LIWC old in this paper . Author has performed their experiments on an earlier version of the dataset on hotel domains only. To build the classifiers we have extracted 92 text dimensions as a text features from LIWC, twelve metrics of lexical diversity, eight metrics for readability along with unigrams and bigrams from R packages. We have used some standard feature selection techniques to avoid overfitting, improve accuracy, and reduce training time. Not only that but also some time to include redundant features can be misleading to modeling algorithm.We have used Weka [] as a feature selection tool. We tried every attribute evaluation method available in Weka to select best features. Chi-square, information gain, and gain ratio outperformed others. R software is used for the simulation.
To check the effectiveness of feature sets, we trained classification models for each of them individually . The table 7 and table 8 shows how different feature set performed individually with different learning methodologies for hotel domain and restaurant respectively. In terms feature sets, we find psychological process most effective to differentiate between truthful and deceptive opinions. A Newer version of LIWC features set ( LIWC all ) is giving a better result than the older version (LIWC old ). The reason of improved performance is the inclusion of new text dimension such as tone, authentic, informal etc.  Apart from that, LIWC all feature set is also performing better than LEX and READ also. The difference between these psychological processes supports Zukerman's claim that psychological processes likely to occur more or less often when people are lying compared with when they are telling the truth [28]. To understand this result better, we have to go in what LIWC all subsumes. It determines the degree any text uses positive or negative emotions, self-references, casual words, and 80 other language dimensions. We have also observed a difference in word count, sentence, etc. which also included in this feature set empowers Vrij claim that liars offer fewer details to allow for fewer opportunities to be disapproved [29]. Even though LIWC all is showing good classification accuracy compare to LEX and READ. But LIWC all has more features comparative to both READ and LEX and moreover all these feature sets can work as complementary to each other. Statistics in Table 1,2 and 4,5 shows a clear difference in readability and lexical diversity between deceptive and truthful reviews. Two tail t-test easily rejected the null hypothesis and showing a significant difference. Classification accuracies on these feature sets also give strength to both our hypothesizes which we have assumed earlier about readability and lexical diversity. Table 9 and Table 10 shows the result of all three strategies along with their feature set for all learning models for hotel and restaurant domains respectively. All learning models trained only on n-grams have performed comparatively better than those trained on LEX, READ and LIWC all feature set. It shows that context of the documents needs to be considered, and all other feature sets worked as complementary to improve the accuracy further. On the other hand among all the classifiers, ensemble methods (mainly bagging and boosting) have outperformed others on most of the occasions.    figure 2 and 3 shows the best performance for each classifier on 10-fold cross-validation using NG, LIWC all , READ and LEX feature sets for hotels and restaurants domain respectively. By treating deceptive spam detection as a genre identification task we used only genre-sensitive feature sets LIWC psy and LEX. We achieved accuracy up to 83 % for hotels and 81% for the restaurants . In the previous work [3] they used part of speech (POS) as genre identification feature and achieved up to 73% accuracy for hotels. On the other hand treating the problem as linguistic deception detection and using only linguistic features we achieve up to 80% for hotels domain and 79% for restaurant domain. Table 11 and 12 shows micro precision, recall, and f-score for a best-performing method for all strategies on the respective feature set.  In our experiments, we have noticed that in most of the cases no significant difference in accuracies between RF and SVM. And an advantage of using random forest over bagging and boosting is that it is faster and relatively robust to outliers and noise. Apart from that, it gives an internal estimation of correlation and importance of the feature which has been shown in Table 13. In this study, we also contrasted with some of the findings in previous research. .For example, across the studies, it has been found that deceptive statements are moderately descriptive and distanced from self compared to truthful ones [30]. In the case of deceptive reviews, we found less total word count and sentence but more self-referencing. Deceptive reviews are less descriptive, and the reason behind it might be the fear of being caught. It also has been observed that to make an impact, spammers either go extremely positive or extremely negative. A clear difference has been seen in negative and positive feature values in both types of reviews.
By using different linguistic measures, researchers found that non-naive individuals assigned to be deceptive compared with naive individuals who were truthful showed less diversity and complexity [31]. Our study also supports both the claims as we also found fewer exclusion words that are also a marker of complexity in deceptive reviews and less diversity because in the lack of real experience the spammer borrow experience from other reviewers.

VI. CONCLUDING REMARKS
It's a widely accepted fact that deceptive spam detection is difficult to detect manually. In this work, we have trained a automated classifier with high accuracy using domain-independent features. We have discovered the relationship between deceptive opinions and lingustic features like readability, lexical diversity. This work has shown different ways to form the problem of deceptive spam detection and effective strategies to solve them. A detail experiment and analysis has been shown for various machine learning algorithms. This paper made many theoretical contributions and contrasted some deceptive assumptions and also strengthen many.
Spammers are getting smart every day that's why for future, both domain specific and independent deceptive clues needed to be discovered. One of the possible future direction to evaluate these deception clues to other domains.