1 Introduction

Nowadays, users undertake a variety of online activities such as purchasing and selling items, disseminating ideas via blogs, and exchanging information in general. Such information is not always reliable: some people use Internet to transmit information for the purpose of manipulating and deceiving other users. For instance, when a user wants to buy an item online, the main way for them to know whether the product is good or not is to read the section of opinions about it on the seller’s Web page. Such opinions have been shown to have a big impact on the final decision of acquiring or not the item. Because of this, some sellers hire people to write positive opinions in order to increase the sales of a product, even if those people do not have a real idea about the quality of the item. In other cases, deceptive opinions aim to discredit products offered by the competitors.

Apart from deceptive texts written to manipulate buying decisions of the users, there are also deceptive texts that intend to change the opinion or the viewpoint of people about a certain subject, such as a political candidate or an issue of public debate.

This makes the study of detecting deceptive texts very important. The task can be defined as identifying those written opinions in which the author aims to transmit information that he or she does not believe in (Keila and Skillicorn 2005). Studies on deceptive texts have empirically proven that truthful communication is qualitatively different from deceptive communication (Ekman 1989; Twitchell et al. 2004). Because of this, a number of projects have been launched with the aim of identifying deceptive texts as accurately as possible. For this, various datasets have been created. Such datasets consist of texts labeled as truthful or deceptive. In a machine-learning approach, a part of the texts in the dataset is used as the training set for a classifier and the remainder as a test set. Thus, a direct comparison between different classifiers and feature selection methods is possible by applying them to the same dataset.

In this work, we address two research questions. First, we assess the appropriateness of support vector networks (SVN) for the task of classification of deceptive texts as precisely as possible. Second, given that features based on LDA showed good performance when they were evaluated on each dataset separately (see Sect. 4.1), we explore whether a feature set can be sufficiently general to be used for classifying a dataset on a different topic from the topic of the dataset used for training, which would allow creating domain-independent general-purpose deception text detectors.

For this, we generated features by using various methods, such as the latent Dirichlet allocation (LDA), the linguistic inquiry and word count (LIWC) method, and a word-space model (WSM), as well as combinations of features generated with such methods. To prove the efficiency of each method, we use three datasets on different topics, specifically: OpSpam, which consists of opinions about hotels, DeRev, which consists of opinions about books bought on Amazon, and the Controversial Topics dataset, which is composed of opinions on three topics (abortion, death penalty, and best friend). Based on datasets collected, we investigate which method is better in one-domain setting, where both the training and test sets are on the same topic, in a mixed-domain setting, where both training and test sets are on a mixture of topics, and in a cross-domain setting, where the training and test sets are on different topics.

With these experiments, we evaluate the possibility of using existing datasets to detect deceptive texts on a topic for which there is no dataset available, that is, the possibility of developing a general-purpose, domain-independent deceptive text detector.

The paper is organized as follows. In Sect. 2, we discuss state-of-the-art approaches to detection of deceptive texts. In Sect. 3, we give a detailed description of feature sources (Sect. 3.1) and deception detection datasets (Sect. 3.2) we used. In Sect. 4, we describe our experimental setup and discuss our results. Finally, we draw our conclusions in Sect. 5.

2 Related work

To test performance of models for detecting deception in text, several labeled datasets have been developed. As a result, datasets were created in different ways. Gokhman et al. (2012) introduced two general ways for developing datasets: sanctioned and unsanctioned deception. In first participants are asked to lie and in second participants lie on their own. These datasets allow making a comparison among the models’ performance.

To create models that lead to detect deception, different sources of features are used. These sources sometimes are combined to achieve a more accurate classification. Bag-of-words (BoW) is a common approach used to generate features for representing documents; this approach disregards grammar and even word order but keeps counting the number of instances of each word. BoW approach includes single words and n-grams. On the other hand, features based on writing style are also used. Unlike BoW approach, linguistic style considers the context to the words. Additionally, certain general deception cues are sought for detecting deception (DePaulo et al. 2003), for example, the use of unique words, self-references, modifiers, among others.

Techniques similar to BoW are based on different kinds of elements extracted from text, such as words, syllables, phonemes, letters, etc. In this way, Hernández Fusilier et al. (2015) compared word n-grams with letter n-grams. The latter shown to yield a better performance on the classified dataset. However, even though n-grams alone give acceptable results, usually they are mixed with other NLP techniques; such combined feature sets often improve the results.

Another method for representing documents is to use handcrafted dictionaries. Newman et al. (2003), for example, by analyzing LIWC’s word categories, found that liars use fewer self-references and use more negative emotion words. This work laid the foundation for the LIWC tool to be widely used by other researchers (Schelleman-Offermans and Merckelbach 2010; Toma and Hancock 2012).

For instance, Mihalcea and Strapparava (2009) used the LIWC tool to discover dominant classes of deceptive texts. The authors classified a corpus of deceptive and truthful texts on controversial topics such as abortion, death penalty, and a best friend. In a similar study, Pérez-Rosas and Mihalcea (2014a) attempted to classify texts on the same topics but in different languages, such as Spanish texts written by native speakers from Mexico, English texts written by speakers from the USA, and English texts written by speakers from India.

Following the work by Mihalcea and Strapparava, Almela et al. (2012) conducted a study to detect deceptive texts written in Spanish. The authors collected a new dataset with topics on homosexual adoption, bullfighting, and feelings about a best friend. One hundred deceptive documents and one hundred truthful ones were collected for each topic, of 80 words per document on average. Distinct LIWC dimensions were used to achieve a more accurate classification by using a support vector machine (SVM).

Deception detection has been applied in different particular aspects. Williams et al. (2014) compared lies told by children and lies told by adults. The authors aimed to detect deception in courts where children testified. To generate the dataset, 48 children and 28 adults were chosen; half of the children and adults told lies and half of them told truth. Thus, the authors used the LIWC tool for generating samples for classification. Research findings showed existence of significant differences between truthful and deceptive texts, which mainly involve linguistic variables such as singular self-references (e.g., I, my, me), plural self-references (e.g., we, our, us), and positive and negative emotions. In addition, results showed that such linguistic variables were found in distinct proportion depending on whether the lie was told by a child or an adult.

In (Hauch et al. 2012) several works of deceptive text identification were analyzed. Most of them are based on documents processed by computer programs; more specifically, documents were mainly represented based on the LIWC tool. Research findings showed that liars use certain linguistic categories at a different rate than the truth-tellers.

BoW and dictionary methods have shown good performance; however, in the effort to improve results, the context of words has been also taken into account, for example, by analyzing the syntactic relations between the words using dependency trees (Feng et al. 2012; Xu and Zhao 2012). In general, the use of syntactic relations has not shown an outstanding performance in the task of classifying deceptive text. However, combining this method with a BoW approach can improve results.

In some cases, not only information of words or syntactic structures in the text is available, but also additional information supplied by the source from which the texts were extracted. Fornaciari and Poesio (2014) collected fake and real opinions from the Amazon Web site. The authors took into account, for example, information on who bought the book in question and who did not: people who bought the book and wrote their opinion had higher credibility than people who did not buy the book. This kind of extra information is rarely available, so we did not focus on this kind of features in the present study.

Pérez-Rosas and Mihalcea (2014b) collected features using different approaches, such as part-of-speech (PoS) tags, context-free grammars (CFG), unigrams, and LIWC, as well as combinations of these features. The authors achieved accuracy between 60 and 70% predicting whether a person of feminine or masculine gender had written a deceptive text. Research findings showed that the use of PoS tags and CFG did not significantly improve accuracy as compared with unigrams and LIWC. This suggests that BoW- and dictionary-based approaches give performance similar to that of linguistic style approaches.

Combining different methods is a common method in generating accurate models used in deception detection. This is done to exploit the advantages of each approach and thus improve the accuracy of classification. However, to our knowledge getting a universal domain deception detector has been scantily studied.

3 Deception detection

In this section we present our method for deception detection. First, in Sect. 3.1 we present our model for deception detection using support vector networks. Next, in Sect. 3.2 we describe the different datasets we will use for evaluation. Finally, in Sect. 3.3 we detail the different feature sources we will use.

3.1 Support vector networks

Support vector networks have been widely applied to solve tasks in different areas (Petković et al. 2014a, b; Shamshirband et al. 2014; Altameem et al. 2015; Gani et al. 2016; Kisi et al. 2015; Mohammadi et al. 2015a, b; Olatomiwa et al. 2015a, b; Piri et al. 2015; Protić et al. 2015; Shamshirband et al. 2015a, b; Al-Shammari et al. 2016a, b; Gocic et al. 2016; Jović et al. 2016a, b; Shamshirband et al. 2016a, b; Shenify et al. 2016).

For generating of a model that more closely represent the deception, we propose using support vector networks with an attribute selection (Guyon and Elisseeff 2003). The latter is important for removing repetitive and irrelevant features.

In the process of feature selection, we used the WEKA (Hall et al. 2009) tool with an evaluator based on correlation proposed by Hall (1999). Furthermore, we chose a search criterion based on hill climbing with backtracking. Such combination showed a significant increase in accuracy. We found that binarizing the feature vectors gave a more accurate model for deceptive texts detection in the present experimental setup. The same process of attribute selection, explained above, was conducted in all our experiments.

The procedure for feature vectors generation is as follows:

  • To form vectors of features by using a word-space model (WSM), first, we obtained lemmas and kept stop words. Secondly, a list of all words without repeating (types) found in the texts set (as deceptive texts as truthful texts) was generated. Next, given a document and a list of types, if the current word of the type list is contained in the document, then the feature value was converted into one, in other case was converted into zero.

  • LDA shows as result vectors of features with real-type values (probabilities of belonging to topics). Therefore, we proceeded to convert values of the features into binary values. To that end, a threshold was calculated dividing the sum of all probabilities of belonging by the number of topics established. Each probability that is equal to or greater than the threshold was converted into one; otherwise it was converted into zero.

  • LIWC generates vectors of 64 features. The means of obtaining each vector was as follows. Given a document and the 64 categories, if some word of current category was found in the document, then such feature had the value of one, otherwise it had the value of zero.

Details on the WSM, LDA and LIWC will be given in Sect. 3.3. Classification was conducted by using a support vector network with fivefold cross-validation. We experimented with the following kernels:

  • linear: \(K(\mathbf{x}_{i}, \mathbf{x}_{j})=\mathbf{x}_{i}^{T}{} \mathbf{x}_{j}\).

  • polynomial: \(K(\mathbf{x}_{i}, \mathbf{x}_{j})=(\gamma \mathbf{x}_{i}^{T}{} \mathbf{x}_{j}+r)^{d}, \gamma > 0\).

  • radial basis function (RBF): \(K(\mathbf{x}_{i}, \mathbf{x}_{j})=\hbox {exp}(-{\gamma }\parallel \mathbf{x}_{i}-\mathbf{x}_{j}\parallel ^{2}), \gamma > 0\).

  • sigmoid: \(K(\mathbf{x}_{i}, \mathbf{x}_{j})=\hbox {tanh}(\gamma \mathbf{x}_{i}^{T}{} \mathbf{x}_{j}+r)\).

We obtained best results (values between 59.8 and 76.3%) with the linear kernel (measured for the mixed-domain corpus). For polynomial, we obtained a classification of 50% in all corpora. For radial basis function, values between 49.5 and 52.5%, and finally, for a sigmoid kernel, we obtained values between 49.7 and 51.2% for all corpora. Since these values are near 50%, which corresponds to a random baseline, hereafter all results will be reported for SVN with linear kernel.

3.2 Datasets used

We experimented with three datasets: the DeRev corpus, the OpSpam corpus, and a corpus of opinions about three controversial topics. The authors of the corpora used two traditional methods to collect deceptive and truthful texts: sanctioned and unsanctioned deception (Gokhman et al. 2012).

DeRev dataset (DEception in REViews) (Fornaciari and Poesio 2014) is a corpus composed of deceptive and truthful opinions obtained from the Amazon Web site. This corpus includes opinions about books. This gold-standard corpus contains 236 texts, of which 118 are truthful and 118 are deceptive.

The confidence in that the deceptive texts were collected correctly is based on two communications, by Sandra ParkerFootnote 1 and by David Streitfeld.Footnote 2 Parker claimed that she received a payment for writing opinions about 22 books. Streitfeld made known four books in which their authors admitted to be paid for writing opinions. DeRev’s authors analyzed these communications and focused on twenty writers of fake opinions, which resulted in a corpus of 96 deceptive opinions.

To obtain the 118 truthful texts, DeRev’s authors took into account certain aspects to ensure a high probability of that the selection was correct by making sure for the texts not to exhibit any cue of deception. Those cues of deception refer mainly on such aspects as whether opinions were written by users who used their real name, whether opinions were written by users who actually bought the book in question from Amazon, among others.

In this dataset, the deceptive and truthful texts were not obtained in a deliberate manner, i.e., the participants were not asked to write lies; instead, the texts were obtained after the participant has lied. Thus, this is a corpus of unsanctioned deception.

A sample of deceptive text

I definitely fell in love with this book! I am a huge poetry fanatic. In all honesty, my initial thoughts of the title of this book were an instant reminder of the movie Final Destination, but of course, I was not expecting the book to be a depiction of the movie. In opposition, The Final Destination is a book filled with poetry of personal inner thoughts, struggles, and even pain. As a poet as well, it can be quite hard to explain your most intimate thoughts throughout a poetry piece. I can relate to every poem because of the situations of unanswered prayers, feeling blind throughout life, happiness, loneness, failure, not fitting in society, and so much more. Excellent book!

A sample of truthful text

This is an enjoyable account of a man who wanted to live a time of solitude, so he built a cabin and lived by himself, thinking and writing down his thoughts. This is a good account of 19th century life, close to nature. This is probably one of those books every well-educated American should read. This was my first kindle purchase. I have been meaning to read “Walden” for years now and never got around to reading it until I obtained my kindle. First of all, I love the kindle for the variety of classic literature that is available. I do not live close to a public library, so having books delivered to my kindle is great!

OpSpam dataset (Opinion SPAM) (Ott et al. 2011) is a corpus composed of fake and genuine opinions about different hotels. It was collected from the Amazon Web site as well.

OpSpam’s authors used the Amazon Mechanical Turk (AMT)Footnote 3 to generate deceptive opinions. Each participant was given the name of a hotel and its respective Web site; this information allowed writing a review of this hotel. The authors asked the participants to imagine that they worked in a hotel and the administrator asked them to write an opinion about the hotel, as if they were guests. The opinion was to appear real and highlight the positive aspects of the hotel. OpSpam’s authors limited the submissions to one by participant, not allowing the same person to write more than one opinion. Additionally, opinions were restricted to those who lived in the USA and had an AMT approval rating of at least 90%. The participants had a maximum of thirty minutes to write the opinion and they were paid one dollar per accepted opinion. In this way, 400 deceptive texts were collected.

On the other hand, truthful opinions were collected from TripAdvisor.Footnote 4 First, 6977 opinions on the twenty most popular hotels were extracted. The authors eliminated 3130 opinions that did not have five stars, 41 ones that were not written in English, 75 ones that had less than 150 characters, and 1607 ones that were written by those who wrote for their first time on TripAdvisor. Eventually, 400 of the remaining texts were selected.

With this, OpSpam’s authors collected a dataset composed of 800 texts in total. The participants were asked to write lies to obtain the deceptive text. As a result, this is a corpus of sanctioned deception.

A sample of deceptive text

The Hyatt Regency Chicago hotel is perfectly located in the center of downtown Chicago. Whether you are going there for business or pleasure, it is in the perfect place. The rooms are large and beautiful and the ball room took my breath away. The wi-fi connection was perfect for the work I needed to do and the show at the Navy Pier was perfect for when I needed a break. Other hotels have nothing on the Hyatt. I just wish there was a Hyatt Regency in every city for all of my business trips.

A sample of truthful text

I stayed for four nights while attending a conference. The hotel is in a great spot—easy walk to Michigan Ave shopping or Rush St., but just off the busy streets. The room I had was spacious and very well appointed. The staff was friendly, and the fitness center, while not huge, was well equipped and clean. I have stayed at a number of hotels in Chicago, and this one is my favorite. Internet was not free, but at $10 for 24 hours is cheaper than most business hotels, and it worked very well.

Opinions dataset (Pérez-Rosas and Mihalcea 2014a) is a corpus composed of opinions about three controversial topics: abortion, death penalty, and a best friend. It consists of 100 deceptive texts and 100 truthful texts. The collection of texts was conducted through AMT and the task for English originating from the USA was restricted to the participants who lived in the that country (the dataset also contains texts in English of speakers from India and Spanish of speakers from Mexico; however, these texts were not used in our research).

To obtain truthful texts, the authors asked participants to write a real opinion about each of the topics. In contrast, to obtain deceptive texts, the participants were asked to lie about their opinion; thus, this is a corpus of sanctioned deception.

A sample of deceptive text

Abortion is murder and people who kill others should be put to death. It goes against the teachings of the Bible and is the worst kind of sin. We should do everything to stop it, no matter the cost. People should be ashamed for even thinking about having an abortion.

A sample of truthful text

I believe that abortion in some cases is positive thing. Of course, in the case of rape or if the baby would be deformed or have no life quality it is acceptable. I do not believe abortion should be used as a form of birth control, meaning that every “mistaken” pregnancy can be dealt with an abortion. In some cases, extreme financial distress, perhaps it can be an alternative.

3.3 Sources of text features

We focused on three feature sources: two types of feature usually found in text deception identification works, namely word-space model (WSM) and the LIWC dictionary, and the semantic continuous space model (LDA).

Unigrams have been used in several works (see Sect. 2) as basis for combining with other kinds of features. This is because features based on unigrams are very informative in the task of deception detection. In addition, those features were commonly combined with LIWC for obtaining behavioral information latent in documents. However, the main drawback is that LIWC is a handcrafted resource; thus, a specific LIWC tool is necessary to analyze a different language.

We use a binary word-space model instead of unigrams due to the fact that both methods showed similar performance even when the former is simpler. Furthermore, we use LDA for adding semantic information to the model. Unlike LIWC, LDA analyzes statistically documents to extract features regardless of the language in question.

Word-space model (WSM) Several previous works have shown that words are very important and relevant features for the task of identifying deceptive texts. For this reason, we decided to analyze the performance of features based on a matrix of words, which represent a word-space model.

To generate these features, we formed a list of all words \(W_{1}, W_{2}, \ldots , W_{n}\) in the dataset. Then, we analyzed each document by searching whether \(W_{n}\) exists in the current text, in which case the feature \(n (F_{n})\) was set to 1, otherwise it was set to 0. Figure 1 shows how vectors of features were represented.

Fig. 1
figure 1

This example shows how vectors of features are formed by using a binary word-space model

Linguistic inquiry and word count (LIWC) is a word counting tool (Pennebaker et al. 2007). It is based on groups of words manually labeled. LIWC classifies words in emotional, cognitive and structural component categories. It was developed for studying the psycholinguistic concerns dealing with the therapeutic effect of verbally expressing emotional experiences and memories. LIWC provides an English dictionary composed by nearly 4500 words and word stems. Each word can be classified into one or more of 64 categories. These are classified in four groups: linguistic processes (pronouns, articles, prepositions, numbers, negations), psychological processes (affective words, positive, negative emotions, cognitive process, perceptual process), relativity (time, space, motion), and personal concerns (occupation, leisure activity, money/financial issues, religion, death, and dying).

We used the version 2007 of LIWC with the English Dictionary as of 04/11/2013. Figure 2 shows an example of some groups of words found in LIWC.

Fig. 2
figure 2

Example of some groups of words with its corresponding label contained in LIWC (We show just a few of the words per group than we can find in LIWC)

Fig. 3
figure 3

Example of generated topics by using LDA in texts about death penalty

Latent Dirichlet allocation (LDA) (Blei et al. 2003) is a probabilistic generative model for discrete data collections such as texts collection. It represents documents as a mix of different topics. Each topic consists of a set of words that keep some link between them. Words, in its turn, can be chosen based on probability. The model assumes that each document is formed word-by-word by randomly selecting a topic and a word for this topic. As a result, each document can combine different topics. Namely, simplifying things somewhat, the generation process assumed by the LDA consists of the following steps:

  1. 1.

    Determine the number N of words in the document according to the Poisson distribution.

  2. 2.

    Choose a mix of topics for the document according to Dirichlet distribution, out of a fix set of K topics.

  3. 3.

    Generate each word in the document as follows:

    1. (a)

      choose a topic;

    2. (b)

      choose a word in this topic.

Assuming this generative model, LDA analyzes the set of documents to reverse engineering this process by finding the most likely set of topics of which a document may consist. Unlike LIWC, LDA generates the groups of words (topics) automatically; see Fig. 3.

Accordingly, LDA can infer, given a fixed number of topics, how likely is that each topic (set of words) appear in a specific document of a collection. For example, in a collection of documents and 500 latent topics generated with the LDA algorithm, each document would have different distributions of 500 likely topics. That also means that vectors of 500 features would be created.

Table 1 Accuracy comparison with regard to number of established topics
Table 2 Statistical significance

LDA requires the number of topics to generate to be specified; any change in this parameter may change the classification accuracy. For this reason, it is necessary to find an appropriate value. To find the number of topics that allows an optimal classification, we tested different values for LDA\(\,+\,\)WSM features on each corpus and calculated the average accuracy. Results of those experiments are shown in Table 1; In this table, the number of topics is compared against the obtained accuracy. In addition, it can be seen that by increasing the number of topics, it is possible to reach an optimal point from which increasing the number of topics does not imply an increase in accuracy (i.e., 600 topics).

All experiments hereafter involving LDA consider 600 topics. Each document processed by LDA generates a vector of 600 features, each one representing the probability that the document belongs to each topic. Note that all features are converted into binary values, as we detailed in Sect. 3.1, before classification.

Once that feature vectors are generated using different combinations of features, we proceed to train and test our model on different corpora. This allows us to answer our motivating questions: (1) To assess the appropriateness of support vector networks (SVN) for the task of classification of deceptive texts as precisely as possible, and (2) To explore whether a feature set can be sufficiently general to be used for classifying a dataset on a topic different from the topic of the dataset used for training.

4 Results and discussion

In this section, we present the results obtained on different combinations of datasets using an SVN. First, we performed our classification separately on each dataset, using fivefold cross-validation (for this, testing and training parts of each individual dataset were merged together) (Table 2). This experiment had better performance on most datasets (4 out of 5) with regard to other studies (see Sect. 4.1, Table 3); therefore, we decided to find out the scope of the proposed approach. For that reason, we experimented with mixed-domain classification on a dataset obtained by merging all datasets, with fivefold cross-validation. Finally, we experimented with cross-domain classification by using a concatenation of all-but-one corpora for training, with evaluation on the remaining dataset. We experimented with different kinds of feature and combinations of features: LDA, LIWC, and WSM, as described above.

Table 3 Comparison of our results with other works on the same corpora

4.1 In-domain results

First, we performed experiments on individual datasets; each one devoted to a specific subject domain, with fivefold cross-validation. Figures 4, 5 and 6 show the results on OpSpam, DeRev, and the controversial topics corpora, respectively. A combination of LDA and WSM features yielded the best results in all three cases.

Fig. 4
figure 4

Accuracy obtained on OpSpam corpus with different features

Fig. 5
figure 5

Accuracy obtained on DeRev corpus with different features

Fig. 6
figure 6

Accuracy obtained on controversial topics corpus classification with different features

Before proceeding to other experiments, such as mixed-domain and cross-domain deception detection, we wanted to make sure that our in-domain classifiers were performing accordingly to the state of the art. Thus, we present a comparison of our results with other works in Table 3. Except for OpSpam, we performed better than other works currently known to us.

Table 4 Accuracy, precision (P), recall (R) and F-measure (F) obtained on the combined corpus using SVN
Table 5 Accuracy, precision (P), recall (R) and F-measure (F) obtained on the combined corpus using Naïve Bayes

We show in Table 2 the statistical significance between this research results and other authors’ results. For comparison purposes, we set a significance level (\(\alpha )\) of 0.05 (5%), which means that statistical significance is attained if p-value is less than \(\alpha \). With this significance level, the improvement shown by some of our results would present no statistical significance when compared with other existing methods, making them practically equivalent; however, our approach has the main advantage that, unlike LIWC, LDA can be applied to different languages without needing a new tool for each language. Additionally, compared with the remaining works, our results present a statistically significant improvement.

4.2 Mixed-domain classification

The main aim of these experiments shown below was to investigate to what extent the SVN classifier can be used when we combine the five datasets on different domains, with fivefold cross-validation. With this, the training set contained subject domains (but not specific texts) that were also contained in the test set. In this case, again a combination of LDA and WSM features yielded the best result; see Table 4.

4.3 Cross-domain classification

Unlike the classification combining all domains, for this experiment we selected each dataset once as testing set and used the other remaining datasets as one combined training set. In this way, the subject domain of the test set was not included in the training set. Table 6 shows the results of cross-domain classification. In these experiments, unlike the experiments presented in Sects. 4.1 and 0, the combination of LDA and WSM features did not consistently yield the best accuracy. We show in boldface the best accuracy obtained for each dataset. In most cases (3 out of 5), the SVN outperformed NB; however, with a relatively simple setup of a plain word-space model, NB is able to improve deception detection with features learned from other datasets.

Table 6 Accuracy obtained in cross-domain classification

5 Conclusions and future work

Our motivation was, first, to assess the appropriateness of support vector networks (SVN) for the task of classification of deceptive texts, and, second, to explore whether a feature set can be sufficiently general to be used for classifying a dataset on a topic different from the topic of the dataset used for training, which would allow creating domain-independent general-purpose deception text detectors. For the first point, we can conclude that SVNs are indeed suitable and give good performance on deceptive text classification. We obtained the best results with a linear kernel.

As to the second question, we conducted two tests: the first one consisting on a mixed corpus, where the information of all training parts was merged to form a single combined corpus. For this test, we obtained an accuracy of 76.3% using SVN with LDA \(+\) WSM as features. This would be the expected performance if we had to detect deception within one of the subjects covered in the datasets, i.e., opinions about books, hotels, best friends, abortion, and death penalty.

The second test consisted in detecting deception in a domain that had not been seen at all in training. This would allow us to measure the degree of prediction that could be achieved in order to classify a text that was not in the scope of the datasets. In this test, we slightly surpassed the baseline of random classification given two classes (deceptive vs. truthful), with results ranging from 53.8% on the OpSpam corpus to 64% on the “best friend” corpus. In most cases, though not always, the best classification result was obtained by SVN with LDA \(+\) WSM features. For some datasets (OpSpam and Abortion), the NB classifier with WSM features alone performed better than SVN. On average, NB with LDA \(+\) WSM features provided the best results (55.55%).

In general, LDA-based features provide a means for generalizing deception cues in terms of semantic topics; however, this seems to yield only a slight increase in performance in terms of a more general deception detector.

As a future work, we plan to investigate other features, such as syntactic style patterns, among others, which could help to identify deception in a broader purpose range.