1 Introduction

Investor sentiment which refers to the opinion of stock market has been a research hot spot of behavioral finance in recent years. The stock pricing model based on hypothesis of rational economic man infers that the price of a stock is determined by the discounted future dividends. Since the market stay rational, investor sentiment has little influence on stock market. However, researches on behavioral finance show several irrational phenomena may be provided by investor sentiment in stock market, and these irrational deviations are systemic. Instead of institutional sentiment, the vast majority of literatures related to investor sentiment tend to regard individual investors as sentiment traders [14].

To study investor sentiment and its effects on stock markets, it is necessary to identify individual’s view on future trend of stocks. Previous literatures mainly contain 3 types of indicators to measure investor sentiment: the investor sentiment index from surveys, the indirect indicator from historical trading, and the investor sentiment index from internet information. The studies with surveys mainly use a variety of consumer sentiment survey index and consumer confidence index as a proxy variable of investor sentiment [58]. It is a direct indicator to obtain investors’ opinions, but it takes a large amount of time and costs for survey. The indirect indicators based on historical information of stock market include historical prices, trading volume of stocks, etc. [915]. This method saves time and costs, but it is an indirect indicator of investor sentiment and cannot reflect investors’ opinion directly. Besides, this method is limited by the variable selection and synthesis methods. Further, because the investor sentiment is obtained from historical data, it has hysteresis characteristic and can hardly reflect new information. The high penetration of internet and the rapid development of data mining technology provide a way to identify investor sentiment directly and conveniently. Thus, applying text mining technology to obtain individual sentiment on stock forum has attracted much attention. Antweiler and Frank [16] used Naive Bayes algorithm and standard SVM to classify 1.5 million comments on Yahoo Finance based on manually annotation training set containing 1000 samples. Das and Chen [17] used several methods such as word count method, Bayesian classifier, etc. to analyze 300–500 comments as training set on Yahoo Finance from July 2001 to August 2001. Kim and Kim [18] used Naive Bayes algorithm with 4000 comments as training set to classify more than 32 million comments on Yahoo Finance from January 2005 to December 2005. Wu et al. [19] used SVM on 30,000 manually labeled reviews for 3-class sentiment classification. In general, standard SVM is widely used to classify investor sentiment from stock forum in financial studies. However, quite a number of neutral views involved in investor sentiment make it difficult for classification. In previous literatures, some of studies directly used 2-class classification method, which would reluctantly classify neutral views into positive or negative samples and reduce reliability of investor sentiment identification. Some other studies used 3-class classification method to separate views into positive, negative or neutral points, but the accuracy they calculate is the total accuracy of the 3 categories including neutral samples. Actually, the importance of 3 categories is different for the purpose of investor sentiment identification. We care more about accuracy of positive and negative samples, and it would be better if the neutral samples are just used as auxiliary part for classification. To solve this problem, this paper introduces universum support vector machine algorithm (U-SVM) to use these neutral points as universum samples to identify investor sentiment, since universum are additional samples belonging to none of the class and it has been proved that universum positioned “in between” the two classes to help obtain better results for classification.

Universum attracts wide attention as it contains priori knowledge for classification. It obtains additional information for a certain problem to be solved, and it belongs to none of the class. Take a handwritten digits recognition task for example, to distinguish digit “5” and “8”, the other 8 digits excluding these 2 numbers can be set as universum samples. Compared with semi-supervised classification, both universum learning method and semi-supervised method contain unlabeled samples. However, unlike semi-supervised classification, universum sample does not belong to any class, whereas the unlabeled input in semi-supervised learning belongs to a certain class, although which class is not known in advance. Vapnik [20, 21] proposed the idea of universum and introduced it as an algorithm for SVM. Weston et al. [22] conducted the first experiments on training SVM with universum and showed the accuracy improvements with universum samples. They called the algorithm they proposed U-SVM. Sinz et al. [23] analyzed U-SVM algorithms and suggested that a good universum set was positioned “in between” the two classes. Cherkassky and Dai [24] and Cherkassky et al. [25] studied effectiveness of the U-SVM for high-dimensional data and found it depended on the distribution of universum samples relative to standard SVM decision boundary. Dhar and Cherkassky [26] extended such conclusions from Cherkassky et al. [25] with different misclassification costs. Previous studies on U-SVM algorithm has evaluated effectiveness and characters of appropriate universum samples. Besides using universum samples in standard SVM, several other machine learning methods with the universum have been proposed [2737], which provide further evidence of the effectiveness of the universum. As for application, U-SVM method is widely used for the classification problem when training dataset contains additional samples belonging to none of the class that we are interested in. Gao et al. [38] used U-SVM to recognize translation initiation sites for protein sequences extraction. Chen and Zhang [39] conducted experiments with U-SVM on handwritten digits and human faces. Jiao et al. [40] classified tongue images using U-SVM. Hao and Zhang [41] applied U-SVM method to neuroimaging-based Alzheimer’s disease classification studies. All of these applications have suggested great performance of universum samples. Because it has been proved that neutral samples located between the two classes (positive and negative) are more likely to obtain better results, whether it contains labeled neutral samples in dataset becomes a problem when using U-SVM algorithm. Alternatively, some researchers use random averaging samples which are generated by a pair of random positive and negative training samples and regard their average as universum if there is no labeled neutral sample in dataset.

The idea of universum is to use prior knowledge in additional samples. Several methods add new samples into original training set to improve classification performance. Semi-supervised method requires unlabeled samples to have the same distribution with original samples. In noise injection method, the added samples need subject to a different distribution. However, U-SVM does not need such assumptions about the distribution. Moreover, because we care more about accuracy of positive and negative samples than neutral samples, compared with standard 3-class SVM, U-SVM we use considers neutral sample as an auxiliary part for classification and is more suitable for our purpose. Since investor sentiment involves quite a number of neutral views which may influence classification and U-SVM algorithm can make good use of these neutral samples as universum to improve classification, this paper uses support vector machine with universum samples to classify the posts on stock forum. We define bullish views as positive samples, bearish views as negative samples, and neutral views as universum samples. We compare the classification accuracy of U-SVM with those of standard SVM. Besides, we further discuss the situation of a 3-class problem to identify neutral views for out-of-sample prediction in financial studies.

2 Background

2.1 Support vector machine

It is widely acknowledged that support vector machines (SVMs) introduced by Vapnik et al. in 1990s [20, 42, 43] are powerful classification methods, and they are widely used in variety of fields [4448]. Their mathematical representations, geometrical explanations, generalization abilities, and empirical performance make SVMs useful in a large amount of classification applications [49]. The goal of SVM is to learn an appropriate decision function to classify new samples after training with a labeled dataset.

Consider a classification problem in n-dimensional space with l training samples. The training samples can be defined as:

$$ T = \left\{ {\left( {x_{1} ,y_{1} } \right),\left( {x_{2} ,y_{2} } \right), \ldots ,\left( {x_{l} ,y_{l} } \right)} \right\}, $$
(1)

where \( x_{i} \in R^{n} ,i = 1, \ldots ,l \), and for a binary classification problem, \( y_{i} \in \left\{ {1, - 1} \right\} \). The goal is to identify a new sample x belonging to which class (1 or −1) after training with dataset T. To solve this problem, a decision function f(x) is needed to separate the \( R^{n} \) space into 2 regions:

$$ f(x) = \text{sgn} (g(x)), $$
(2)

where g(x) is a real function to obtain the value of y for each x. Particularly, for a linear classification problem, g(x) can be a linear function:

$$ g(x) = w \cdot x + b, $$
(3)

and the corresponding hyperplane is \( w \cdot x + b = 0\).

For nonlinear separation, an appropriate map Φ is needed to transform an n-dimensional vector x into another m-dimensional vector in space R m. Thus, the maximal soft-margin algorithm of the SVM deduces the following primal optimization problem:

$$ \begin{array}{*{20}l} {\mathop {\hbox{min} }\limits_{w,b,\xi } } \hfill & {\frac{1}{2}\left\| w \right\|^{2} + C\sum\limits_{i = 1}^{l} {\xi_{i} } } \hfill \\ {{\text{s}}.{\text{t}}.} \hfill & {y_{i} \left( {w \cdot \varPhi \left( {x_{i} } \right) + b} \right) \ge 1 - \xi_{i} ,} \hfill \\ {} \hfill & {\xi_{i} \ge 0,\quad i = 1, \ldots ,l,} \hfill \\ \end{array} $$
(4)

where C is a penalty parameter and \( \xi_{i} \) represent the slack variables. In this paper, we use a nonlinear kernel function, and set \( \varPhi \left( {x_{i} } \right) \cdot \varPhi \left( {x_{j} } \right) = K\left( {x_{i} ,x_{j} } \right) \). A convex quadratic programming problem can be constructed as:

$$ \begin{array}{ll} {\mathop {{\text{min}}}\limits_{\alpha } } \hfill & {\frac{1}{2}\sum\limits_{{i = 1}}^{l} {\sum\limits_{{j = 1}}^{l} {y_{i} y_{j} \alpha _{i} \alpha _{j} K\left( {x_{i} ,x_{j} } \right)} - \sum\limits_{{j = 1}}^{l} {\alpha _{j} } } } \hfill \\ {{\text{s}}.{\text{t}}.} \hfill & {\sum\limits_{{i = 1}}^{l} {y_{i} \alpha _{i} = 0} ,} \hfill \\ {} \hfill & {0 \le \alpha _{i} \le C,\quad i = 1, \ldots ,l,} \hfill \\ \end{array} $$
(5)

where \( \alpha_{i} \) are Lagrangian multipliers. After obtaining the solution \( \alpha^{*} = \left( {\alpha_{1}^{*} , \ldots ,\alpha_{l}^{*} } \right)^{T} \), the optimal separating hyperplane can be given by:

$$ \begin{aligned} g\left( x \right) & = \sum\limits_{i = 1}^{l} {y_{i} \alpha_{i}^{*} } K\left( {x_{i} ,x} \right) + b^{*} ,\; \\ b^{*} & = y_{i} - \sum\limits_{i = 1}^{l} {y_{i} \alpha_{i}^{*} } K\left( {x_{i} ,x_{j} } \right). \\ \end{aligned} $$
(6)

Then, a new sample is classified as 1 or −1 according to the decision function of formula (2).

2.2 Support vector machine with universum

A dataset of universum is a collection of additional samples known to belong to none of the class, and it contains priori knowledge for classification. The structural risk minimization principle in standard SVMs is to choose an appropriate decision function after finding a set of candidate decision function F, and it contains no prior knowledge for the learning task. Support vector machine with universum constructs a data-dependent structure on the set of admissible functions using universum samples. Compared with defining a distribution explicitly, obtaining a set of universum samples is more convenient for a learning task.

According to Cherkassky and Dai [24], Fig. 1 is an illustration of SVM with universum. Since universum belongs to none of the class, when using maximal margin algorithm, universum samples would better fall inside the margin borders. Take Fig. 1 for example, if margin width of the 2 hyperplanes are same, Hyperplane II is better than Hyperplane I as it contains a larger number of universum samples fall inside the margin borders. Thus, training SVM with universum should both use maximal soft-margin algorithm and maximize the amount of universum samples distributed near the hyperplane.

Fig. 1
figure 1

Illustration of SVM with universum

Consider a training set given additional universum samples:

$$ T = \left\{ {\left( {x_{1} ,y_{1} } \right), \ldots ,\left( {x_{l} ,y_{l} } \right)} \right\} \cup \left\{ {x_{1}^{*} , \ldots ,x_{u}^{*} } \right\}, $$
(7)

where \( x_{j}^{*} \in R^{n} ,j = 1, \ldots ,u \) represent universum samples. Since universum samples reflect prior knowledge of the classification task by approximating them equivalent to hyperplane g(x) = 0, the primal optimization problem of maximal soft-margin algorithm of universum SVM (U-SVM) is set as:

$$ \begin{aligned} \mathop {\hbox{min} }\limits_{w,b,\xi } &\quad\frac{1}{2}\left\| w \right\|^{2} + C_{t} \sum\limits_{i = 1}^{l} {\xi_{i} } + C_{u} \sum\limits_{s = 1}^{u} {\left( {\psi_{s} + \psi_{s}^{*} } \right)} \hfill \\ {\text{s}}.{\text{t}}.&\quad y_{i} \left( {w \cdot \varPhi \left( {x_{i} } \right) + b} \right) \ge 1 - \xi_{i} , \hfill \\ &\quad - \varepsilon - \psi_{s}^{*} \le w \cdot \varPhi \left( {x_{s}^{*} } \right) + b \le \varepsilon + \psi_{s} , \hfill \\ &\quad \xi_{i} \ge 0,\quad i = 1, \ldots ,l, \hfill \\ &\quad \psi_{s} ,\psi_{s}^{*} \ge 0,\quad s = 1, \ldots ,u, \hfill \\ \end{aligned} $$
(8)

where \( C_{t} \) is a penalty parameter and \( \xi_{i} \) is a slack variable for training samples (positive and negative). \( C_{u} \) is a penalty parameter, \( \psi_{s} ,\psi_{s}^{*} \) are slack variables, and \( \varepsilon \) denotes ε-insensitive loss for universum samples. The U-SVM algorithm in formula (8) maximizes the margin between the separating hyperplanes, and it also maximizes the number of universum samples distributed near the hyperplane. Specially, if \( C_{u} = 0 \), formula (8) can be regarded as a standard SVM. The dual problem of U-SVM can be constructed as:

$$ \begin{aligned} \mathop {\hbox{min} }\limits_{\alpha ,\mu ,\nu } &\quad \frac{1}{2}\sum\limits_{i = 1}^{l} {\sum\limits_{j = 1}^{l} {y_{i} y_{j} \alpha_{i} \alpha_{j} K\left( {x_{i} ,x_{j} } \right)} + \frac{1}{2}\sum\limits_{s = 1}^{u} {\sum\limits_{t = 1}^{u} {\left( {\mu_{s} - \nu_{s} } \right)\left( {\mu_{t} - \nu_{t} } \right)} } K\left( {x_{s}^{*} ,x_{t}^{*} } \right)} \hfill \\ &\quad + \sum\limits_{i = 1}^{l} {\sum\limits_{s = 1}^{u} {\alpha_{i} y_{i} \left( {\mu_{s} - \nu_{s} } \right)} } K\left( {x_{i} ,x_{s}^{*} } \right) - \sum\limits_{i = 1}^{l} {\alpha_{i} + \varepsilon \sum\limits_{s = 1}^{u} {\left( {\mu_{s} + \nu_{s} } \right)} } \hfill \\ {\text{s}}.{\text{t}}.&\quad \sum\limits_{i = 1}^{l} {y_{i} \alpha_{i} + \sum\limits_{s = 1}^{u} {\left( {\mu_{s} - \nu_{s} } \right)} = 0} , \hfill \\ &\quad 0 \le \alpha_{i} \le C_{t} ,\quad i = 1, \ldots ,l, \hfill \\ &\quad 0 \le \mu_{s} ,\;\nu_{s} \le C_{u} ,\quad s = 1, \ldots ,u, \hfill \\ \end{aligned} $$
(9)

where \( \alpha_{i} ,\;\mu_{i} ,\;\nu_{i} \) are Lagrangian multipliers. Here, we also choose an appropriate kernel function \( K\left( {x_{i} ,x_{j} } \right) \). We get \( \alpha^{*} = \left( {\alpha_{1}^{*} , \ldots ,\alpha_{l}^{*} } \right)^{\text{T}} ,\;\;\mu^{*} = \left( {\mu_{1}^{*} , \ldots ,\mu_{u}^{*} } \right)^{\text{T}} ,\;\;\nu^{*} = \left( {\nu_{1}^{*} , \ldots ,\nu_{u}^{*} } \right)^{\text{T}} \) by solving formula (9). Then, the optimal separating hyperplane with priori knowledge of universum can be given by:

$$ \begin{aligned} g\left( x \right) & = \sum\limits_{i = 1}^{l} {y_{i} \alpha_{i}^{*} } K\left( {x_{i} ,x} \right) - \sum\limits_{s = 1}^{u} {\left( {\nu_{s}^{*} - \mu_{s}^{*} } \right)K\left( {x_{s}^{*} ,x} \right)} + b^{*} , \\ b^{*} & = y_{i} - \sum\limits_{i = 1}^{l} {y_{i} \alpha_{i}^{*} } K\left( {x_{i} ,x_{j} } \right) + \sum\limits_{s = 1}^{u} {\left( {\nu_{s}^{*} - \mu_{s}^{*} } \right)K\left( {x_{s}^{*} ,x_{j} } \right)} . \\ \end{aligned} $$
(10)

3 Data and preprocessing

3.1 Dataset of investor sentiment

The data which reflect individual investor sentiment are collected from online posts in a large-scale stock forum called Eastmoney stock forum, which is one of the largest stock forum in China. Eastmoney is a financial portal founded in 2004, and its stock forum’s visits and posts have reached a certain scale since the year of 2011. It has become the largest and the most influential financial portal whose effective visit time is accounted for 43.8% of total effective visit time in financial portals. In July 2016, the ranking of Eastmoney in financial portals is 2 according to China Websites Ranking organized by Internet Society of China. Considering the activity degree of online discussion in different sectors, energy sector is selected as a representation in this paper. We get a random sample of 5990 online posts of 24 stocks which are the constituent stocks of CSI 300 energy index as samples.

The 5990 unstructured reviews are manually classified as bullish views, bearish views and neutral views by financial researchers. To ensure the reliability of labels, we accept majority opinions of 5 financial researchers on a certain post. We define bullish views as positive samples, bearish views as negative samples, and neutral views as neutral samples. As a result, the 5990 samples contain 1010 positive samples, 1212 negative samples and 3768 neutral samples.

3.2 Text data preprocessing

For classification, unstructured text reviews should be changed into digital data for computer processing. We call such a process reprocessing. The standard preprocessing methods contain data cleaning, text representation, and feature extraction.

3.2.1 Data cleaning

Since investors often discuss whatever they want on stock forum, the text data there contain a lot of punctuation, noise, etc., and cannot be directly used for analysis. Therefore, we use data cleaning technology to eliminate punctuations and gibberish. We also use some preprocessing technologies for Chinese text specially, such as words segmentation.

3.2.2 Text representation

Text representation is a technology to change text information into digital data. We use N-gram method based on vector space model. Vector space model [50] is one of the text representation methods based on and extending the bag of words model [51], and it is commonly used in preprocessing of text classification. Bag of words model assumes that each word is independent, and it represents a text as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. The N-gram method based on vector space model we use is considering that the relation of adjacent words may probably have effects on classification. Parameter N here represents number of words related to a certain word. That is, the emergence of the N-th word is associated with N − 1 words in front of it. In this paper, we set N = 1, 2, 3. With N-gram method based on vector space model, each text is expressed as a vector in the language space, and each feature gets the weight according to its importance in the text.

When weighting for each feature, Salton et al. [52] suggested that the importance of the word can be reflected by Boolean, frequency, or TF–IDF method. Boolean method simply divides a word into two parts by distinguishing whether it appears in the text. Frequency method considers the frequency of the word and allows computing a continuous weight. TF–IDF is the method that multiplies term (word) frequency and inverse document frequency together, and it assumes the word that often appears in a certain text and seldom appears in whole document is important for classification. In this paper, we use all of these 3 methods for word weighting.

3.2.3 Feature extraction

Feature extraction method is used to remove the features that contribute little for analysis, which may reduce computing complexity and avoid over fitting. If the total number of the features in samples is large and some features are even rarely involved in texts, it can be assumed that these infrequent features have little contribution to classification, and can be ignored. In that case, the feature extraction in this paper is to extract the features whose occurrence time is no less than a certain number. The dimensions of samples in different minimum occurrence times after data cleaning and text representation are shown in Table 1. N-gram from 1 to 3 represents number of words related in a certain feature for vector space model, and Min occurrence from 1 to 5 represents minimum occurrence times of the extracted features. It can be seen that, when minimum occurrence time is small, the feature dimension is high, which may raise computing complexity. When minimum occurrence time is large, the feature dimension is low, which may lose important information for classification. Considering both computing complexity and integrality of information, we set minimum occurrence time as 3 in this paper.

Table 1 Feature dimensions in different preprocessing methods

4 Experiment

We use 5990 preprocessed samples as training set. It contains 1010 positive samples, 1212 negative samples and 3768 neutral samples. The neutral samples in this paper are used as universum. The ranges of parameters for grid search are shown in Table 2.

Table 2 Ranges of parameters

With fivefold cross-validation, we compare accuracy of SVM and U-SVM in different preprocessing methods. Accuracy we use to evaluate the performance of the classifier is as follows:

$$ {\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}}, $$
(11)

where TP represents true-positive prediction, TN is true-negative prediction, FP refers to false-positive prediction, and FN denotes false-negative prediction. Thus, numerator of formula (11) represents the number of correct predictions, and denominator refers to total number of predicted records. The results are shown in Table 3. TF–IDF, Frequency, and Boolean in Table 3 refer to different text representation methods mentioned in part 3.

Table 3 Accuracy of SVM and U-SVM

In Table 3, the bold refers to accuracy of U-SVM performs better than SVM, and the bold italic refers to accuracy of TF-IDF is better than Frequency and Boolean methods.

It can be seen that U-SVM performs better than SVM in all the cases. Particularly, among 3 text representation methods, TF–IDF performs observably better than other 2 methods. Thus, we use TF–IDF text representation method for further discussion. To visually analyze the results, we use the method proposed by Cherkassky and Dai [24] to generate the histogram of projections of samples onto the normal direction vector of the hyperplane for both SVM and U-SVM. We first, respectively, calculate g(x) in formula (6) for training samples of standard SVM and calculate g(x) in formula (10) for both training samples and universum samples of U-SVM, which provides projections onto the normal direction vector of the hyperplane. Then, we generate the histogram by dividing its range into around 15 different bins. The interval of each bins is 0.2. The histograms for samples of 1-gram with TF-IDF as an example are shown in Fig. 2.

Fig. 2
figure 2

Histogram of projections onto normal direction of hyperplane. a SVM, b U-SVM

Figure 2 illustrates the effect of universum on classification. Samples using standard SVM without universum are not separated well, and the samples near the part where 2 lines intersect will be misclassified. However, when using U-SVM, positive samples and negative samples are separated more clearly. It suggests better performance when using U-SVM method, which testifies the theoretical results that a data-dependent structure on the set of admissible functions with universum samples can improve learning performance. Besides, universum samples are intensive distributed near the hyperplane, which provide us convenience for expanding the problem to 3 classes to identify neutral views for out-of-sample prediction in further discussion.

For detailed description, we sample training sets of size 50, 100, 200, 500, and 1000 from original dataset, and compare accuracy for different training subset sizes with different sample sizes of universum. The sample sizes of universum are 0, 200, 500, and 1000. Particularly, when size of universum is 0, it is a standard SVM. The results for samples of 1-gram with TF-IDF are shown in Table 4. The percentages in brackets represent differences of accuracy compared with one row above. We also test other datasets for different text representation methods and obtain the similar results.

Table 4 Accuracy for different training subset sizes

The results in Table 4 suggest a better performance of U-SVM compared with SVM in all these 5 different sizes of training samples. Further, an obvious advantage for U-SVM is obtained when training sizes are small. When training sizes are 50 and 100 in Table 4, compared with standard SVM, accuracy of U-SVM in 3 different sizes of universum is at least more than 3.92 and 7.02%, respectively. It is mainly because the information provided by universum samples makes up for the lack of training data, which proves that universum belonging to none of the class contain priori knowledge for classification. Besides, the accuracy of different sample sizes for universum suggests that advantage provided by universum may not be the sustainable growth with the increasing number of universum samples. Take 1000 training size for example, compared with 500 universum samples, the accuracy of 1000 universum samples is reduced instead (−0.41%). A possible explanation is that since universum follow a certain distribution, when number of universum samples arrives at a certain level, there is almost no more information which new universum samples could provide. Thus, the sample size of universum should consider both accuracy and redundancy.

5 Further discussion

We have already observed a better performance of U-SVM compared with standard SVM in binary classification problem to divide investor sentiment into bullish (positive) and bearish (negative). However, for out-of-sample prediction, there are a large amount of neutral views that cannot be identified as either positive samples or negative samples. Above experiments in this paper define neutral samples as universum which are only used in training process and will not be classified in prediction, but in this part, in order to well identify investor sentiment in financial application, we not only need to identify positive and negative samples, but also need to separate neutral samples. Thus, in this part, we discuss the situation of a 3-class problem to identify neutral views for out-of-sample prediction. Firstly, we discuss the empirical separating hyperplane construction of U-SVM for 3-class classification and then analyze its effectiveness.

5.1 Empirical separating hyperplane of U-SVM for 3-class classification

In formula (10), g(x) is a real function to obtain the value of each input x for classification. In binary classification, separating hyperplane is often set as g(x) = 0. Thus, if g(x) > 0, we classify samples as positive samples, and if g(x) < 0, we classify samples as negative samples. One way to separate neutral samples for prediction is expanding separating hyperplane, which means, introducing the parameter a and setting separating hyperplanes as g(x) = ±a. Samples will be classified as positive samples if g(x) > a, as negative samples if g(x) < −a, or as neutral samples otherwise.

Idea of expanding separating hyperplane is illustrated in Fig. 3. The 2 solid lines in Fig. 3 represent expanding separating hyperplanes g(x) = ± a, and universum data can be separated with appropriate parameter a via −a < g(x) < a. The value of parameter a is set as 1 commonly (as dashed line called Hyperplane in Fig. 3), because in training process, one of the goals is to maximize the number of universum samples fall inside the margin borders −1 < g(x) < 1. However, since the projections onto normal direction of hyperplane g(x) = 0 for universum samples are intensively distributed near the hyperplane (Fig. 2b) and soft-margin algorithm makes some of positive and negative samples beyond their borders, −1 < g(x) < 1 may not be the best choice to identify neutral samples. Instead, it is possible to classify neutral views well via different value of parameter a in different dataset. Thus, we use empirical method to determine an appropriate parameter a for a certain classification problem. Accuracy for such a 3-class classification of different parameter a is shown in Table 5. The results verify our inference that −1 < g(x) < 1 is not the best decision function to identify neutral samples. Instead, the best accuracy is obtained when a = 0.2 in all these 3 cases compared with other value of parameter a. Thus, g(x) = 0.2 are the empirical separating hyperplanes of U-SVM for 3-class classification in this dataset.

Fig. 3
figure 3

Illustration of expanding empirical separating hyperplane

Table 5 Accuracy in different separating hyperplanes

5.2 Performance of U-SVM for 3-class classification

To analyze effectiveness of U-SVM for 3-class classification using above empirical separating hyperplanes, we compared its performance with standard 3-class SVM. The results are shown in Table 6. Accuracy of U-SVM is 70.43%, and accuracy of standard SVM is 70.02%. When comparing accuracy, U-SVM is not obviously better than standard SVM. However, when recognizing investor sentiment in financial studies, effect of misclassifying investor sentiment to neutral view or to its opposite view is different. One is considering mean absolute error. If we label positive samples as 1, negative samples as −1, and neutral samples as 0, the absolute error for misclassifying positive samples to negative samples is 2, while misclassifying it to neutral samples is 1, and vice versa. Here, we introduce mean absolute error (MAE) to measure the performance of classifiers:

$$ {\text{MAE}} = \frac{1}{n}\sum\limits_{i = 1}^{n} {\left| {y_{i} - \hat{y}_{i} } \right|} , $$
(12)

where \( y_{i} \) represents actual label, \( \hat{y}_{i} \) denotes predicted label, and n is total number of predicted records. Another explanation is that after investor sentiment identification, most of the studies would compose investor sentiment index for the subsequent financial studies. The formula for sentiment index composition is often as follows:

$$ M_{t} = \ln \left[ {\frac{{1 + M_{t}^{\text{BUY}} }}{{1 + M_{t}^{\text{SELL}} }}} \right], $$
(13)

where \( M_{t}^{\text{BUY}} \) represents total number of bullish posts (positive samples) in time interval t, and \( M_{t}^{\text{SELL}} \) represents total number of bearish posts (negative samples). The neutral posts play a relatively weak role for sentiment index composing in formula (13). Thus, compared with misclassifying a certain investor sentiment to its opposite view, misclassifying it to neutral view is acceptable instead.

Table 6 Performance of U-SVM and SVM for 3-class classification

We compare the estimated value and actual value of both U-SVM and standard SVM, and find a better performance of U-SVM. As shown in Table 6, Estimation means results of classification based on machine learning and Original label represents above manual-annotated label. The number of negative samples which are misclassified as positive samples in U-SVM is 75, and that of standard 3-class SVM is 119. Similarly, 33 positive samples are misclassified as negative samples in U-SVM, which is obviously less than 102 in standard 3-class SVM. Thus, compared with standard 3-class SVM, most of misclassified data are classified as neutral samples in U-SVM. Since misclassified to its opposite is more harmful than misclassified to a neutral view, U-SVM performs better than standard 3-class SVM. Besides, 294 neutral samples are misclassified (106 misclassified as negative, and 188 as positive) in U-SVM, and 405 neutral samples are misclassified (178 misclassified as negative, and 227 as positive) in standard 3-class SVM. It suggests less neutral samples are misclassified in U-SVM, which will also cause less damaging influences according to formula (13) in sentiment index composing compared with standard SVM. We also calculate mean absolute error in formula (12) for both U-SVM and standard 3-class SVM, and find a significant lower MAE of U-SVM than that of SVM. It is mainly because the data-dependent structure with universum samples contains priori information and improves learning performance, so that positive and negative samples can be separated more clearly, and universum samples can be intensive distributed near the hyperplane, which reduce misclassified neutral samples in U-SVM. Besides, an appropriate value of parameter a for empirical separating hyperplane to identify neutral samples is important. All of these results suggest a better performance of U-SVM in 3-class classification.

6 Conclusions

This paper studies the performance of universum SVM for investor sentiment identification. Our empirical studies suggest that for 2-class classification problem, SVM with universum has a better performance than standard SVM in several text representation methods of the dataset. Results for different training subset sizes suggest that universum have especially good performance for small training dataset. Besides, effectiveness of universum may not be sustainable growth with the increasing number of universum samples. When size of universum samples arrives at a certain level, almost no more accuracy increasing can be provided by additional universum. Since for financial studies of investor sentiment, a large amount of neutral views need to be classified, we further discuss the situation of a 3-class problem to identify neutral views in out-of-sample prediction. We propose that g(x) = 0.2 are the empirical separating hyperplanes of U-SVM for the 3-class classification task in this paper, and compare its effectiveness with a standard 3-class SVM. Accuracy of U-SVM is not obviously better than that of standard SVM, but the results of U-SVM are more acceptable, because compared with standard 3-class SVM, most of misclassified data are classified as neutral samples, and less neutral samples are misclassified in U-SVM. We also calculate mean absolute error for both U-SVM and standard SVM, and find a significant lower MAE for U-SVM.

Overall, this paper shows better performance of U-SVM for both 2- and 3-class classifications and tries to make multiple appropriate explanations. However, due to the constraints of authors’ resources and capacity, the original dataset from stock forum for empirical study is still quite single, even if we use several different text representation methods. In our further work, the original dataset is expected to be more abundant, and we can further discuss the empirical separating hyperplanes of U-SVM for 3-class classification in different datasets. Besides, the method for 3-class classification with U-SVM in this paper is empirical, and we may improve its theoretical part in future.