1 Introduction

Emerging computer technologies have made it possible to design mechanisms capable of gathering data from Internet sources such as blogs, filter the data using predefined categories, and identify the opinions of users by distinguishing between positive and negative responses. This type of sentiment mining has received considerable attention due to the exponential growth of online data with the advent of mobile devices and social network sites. The development of sentiment corpora and mechanisms for machine learning are critical in deciphering online postings, which are often unstructured and loosely formatted (Li and Wu 2010).

In addition to manually created wordlists (e.g., ANEW, Affective Norms for English Words, refering to (Bradley and Lang 1999), there are many mature sentiment corpora used for natural language processing (NLP) and the comprehension of word sense in English. One state-of-the-art English sentiment corpus, SentiWordNet (Esuli and Sebastiani 2006), employs machine learning to classify words found in WordNet, which has predefined positive or negative connotations of the words and synsets, or groups of cognitive synonyms (Miller 1995). SentiWordNet can be used to identify the polarity of reviews in many domains. By using pre-tagged wordlists and applying a large corpus to extend the lists of positive and negative words, researchers have produced numerous machine learning algorithms capable of identifying sentiment and thereby enabled the extraction of semantic orientation from reviews.

Methods of sentiment classification, which derived from a combination of text-mining techniques and NLP techniques have been used to identify a given review as positive or negative. Researchers have developed many approaches, particularly in English, to process the sentiments found in opinions from a variety of perspectives. However, sentiment analysis in Chinese is another story. First, Chinese considerably differs from Indo-European languages. Every Chinese character has its own associated meaning, and modern Chinese words consist of one to six characters or ideographic meanings (Wu and Tseng 1993). The absence of word boundaries makes it extremely difficult to assign correct parts-of-speech tags and perform the meaningful segmentation of sub-sentences or phrases in the processing of natural language. Thus, developing methods to disambiguate Chinese word sense poses numerous challenges. Despite the availability of Chinese sentiment corpora, e.g., National Taiwan University Sentiment Dictionary (NTUSD) (Ku et al. 2006) and HowNet (Dong and Dong 2006), the difficulties applying NLP techniques to Chinese would compromise accuracy in the use of extension wordlists. Another approach would adapt a mature English corpus to Chinese sentiment analysis, but it still requires overcoming the inherent differences between the languages as well as the poor performance of machine translation (Wan 2009). In order to prevent the meaning of opinions expressed in Chinese from being misinterpreted by machines, it requires reflection upon the nature of the Chinese language itself to enable the processing of opinions directly from the most basic elements used to represent concepts. This study tried to challenge the above Chinese sentiment analysis problems.

Furthermore, this study sought to overcome two additional problems associated with the analysis of sentiment in Chinese. First, although words in manually established wordlists have well-known positive or negative connotations, a number of neutral words, which are not included in wordlists, can imply either positive or negative senses in their syntactic features (also called aspects), regardless of whether the features are explicit or implicit (Liu 2010). For example, we may say, “the battery life (feature) of this cellular phone is too short,” where “short” is a neutral word, but it has a negative sense in the above context. Second, the dynamic sentiment of word senses in different contexts can express totally different sentiment orientation (Wu and Wen 2010). For example, different from common sense, “悚” (terrifying) and “皮疙瘩” (goose bumps) appearing in opinions for horror movies would express positive polarity.

Zhang et al. (Zhang et al. 2012) found product weakness from Chinese reviews by using morpheme-based sentiment analysis and relying on the similarity calculation in a predefined wordlist, Hownet. This study tried to resolve the problem of the wordlist constraint. We suggested a morpheme-based method of feature selection to search for domain-dependent Chinese compound words directly from the reviews in a large data set without any help of predefined sentimental resources. Because of the availability of data set, we took the sentiment analysis of movie reviews written in Chinese as the example for demonstrating the superiority of our approach. To assemble sentiment orientation wordlists from the data itself, we collected opinions written about movies from Yahoo! Taiwan and compiled them into movie opinion corpus, containing 127,424 opinions in 18 categories of movies with a total of 4,631,482 words. Considering the star-rating as an indicator of either positive or negative sentiment (Turney 2002), and thus negating potential problems with the inference of star-ratings from reviews (Pang and Lee 2005), this study used PMI (point-wise mutual information) to search for co-occurring phrases in the modification of morpheme-level features to be used as signatures of sentiment, from which to build SVM (support vector machine) classifiers. We then compared the effectiveness of the proposed classifier with that of the classifiers built by TF-IDF (term frequency, inverse document frequency), NTUSD and HowNet, with regard to the analysis of opinions of movies of various genres. Finally, we analyzed the sentiment compounds generated by the proposed classifier and compared these sentiments with those generated by NTUSD and HowNet wordlists.

The following paper structure is organized as follows. Previous related works are presented in Section 2. Section 3 introduces the proposed algorithm, dataset collection, and analysis methods. Evaluation and comparison results are presented in Section 4. In Section 5, we draw conclusions and discuss the findings of this study and future work.

2 Related works

2.1 Sentiment analysis

Sentiment analysis can be categorized into phrase-level, sentence-level, and document-level analyses (Pang and Lee 2008). Commonly used Chinese sentiment dictionaries NTUSD (Ku et al. 2006) and HowNet (Dong and Dong 2006) identify polarity as follows. NTUSD uses manually tagged phrases; HowNet determines polarity using its own Chinese common sense knowledge base. Both wordlists can be used to perform phrase-level (Li et al. 2009; Sun et al. 2010) and sentence-level sentiment analysis (Li and Yao 2007; Ku et al. 2008). To expand sentiment wordlists, statistical analysis and pattern matching can be adapted to match words already classified in the wordlists with additional words sharing their sentiment orientation. One statistical method is PMI, which pairs words and compares their co-occurrence. The PMI algorithm is defined as follows:

$$ PMI\left( wor{d}_1, wor{d}_2\right)={ \log}_2\left[\frac{p\left( wor{d}_1\& wor{d}_2\right)}{p\left( wor{d}_1\right)p\left( wor{d}_2\right)}\right] $$

PMI can be modeled in both small and large window sizes. Using the smaller window size, PMI searches idioms and common phrases; and using the larger window size, PMI highlights semantic concepts and other larger relationships among words (Church and Hanks 1990). Turney (Turney 2002) developed Sentiment Orientation Point-wise Mutual Information (SO-PMI) to calculate the co-occurrence probability of words and phrases in search engines using the NEAR operation. SO-PMI relies on the assumption that a set of web pages can be considered a large corpus. Although NEAR operation is no longer available in current search engines, a number of researchers have proposed a modified version of SO-PMI (formula presented below) for adaption to Chinese sentiment analysis (Ye et al. 2006; Feng et al. 2012). Although SO-PMI can analyze predefined Chinese keyword lists with ease, using SO-PMI to determine the sentiment of unknown words can be problematic because the corpus used by the algorithm is insufficient for the extension of sentiment words.

$$ SO- PIM(phrase)={ \log}_2\left[\frac{ hits\left( phrase\;\mathrm{NEAR}\; excellent\right) hits(poor)}{ hits\left( phrase\;\mathrm{NEAR}\; poor\right) hits(excellent)}\right] $$
(1)

Excellent and poor are two sentiment polarities; NEAR is the search engine operation; hits (token) is the number of tokens returned.

Pattern matching requires Chinese dictionaries, an understanding of grammar details, and natural language processing (NLP) tools to parse sentences into dependency trees capable of isolating sentiment words and their corresponding features. (Tan and Zhang 2008) employed a rule-based approach to Chinese sentiment analysis based on the HowNet lexicon and syntactic structures and analyzed 1,021 documents spanning topics in a variety of domains. Their study analyzed the ranking of books, music, and movie reviews from Amazon China. They reported 79.98 % accuracy using a SVM classifier. These methods can be used in conjunction with a thesaurus to enhance the recognition of sentiment words and improve parsing performance (Xu et al. 2011).

Pre-tagged wordlists are considered essential to Chinese sentiment analysis; however, it is also possible to analyze opinions without the use of seed words. Nasukawa and Yi (Nasukawa and Yi 2003) proposed an NLP-based approach to sentiment analysis for the extraction of sentiment directly from opinions. By applying syntactic parser and self-built sentiment lexicon, their prototype extracts the level of favorability emotion expressed with respect to the topic in opinions. Wan (2009) proposed a pure machine learning approach, known as bilingual co-training, to train both unlabeled product Chinese reviews and its machine-translated English reviews. By leveraging various machine translation services to eliminate the language gap, the bilingual co-training method can outperform both basic and transductive methods.

2.2 Morpheme in chinese

Chinese text consists of a linear sequence of non-spaced or equally spaced ideographic characters, which are similar to morphemes in English (Wu and Tseng 1993; Wu and Tseng 1999). According to morphological processing, most compound words (compound ideographic characters) represent the form and semantic processing of their constituent morphemes (Zhou et al. 1999), Yuen et al. (2004) conducted a pilot study on strongly-polarized Chinese words, which are composed of positive morphemes (e.g., 獎(gift), 勝(win) 優(good)) or negative morphemes (e.g., 傷(hurt), 貪(greedy), 疑(doubt)). They performed sentiment analysis on the Linguistic Variations in Chinese Speech Communities (LIVAC) corpus. Their research indicated that sentiment analysis can employ the morpheme within each compound to express compound sentiment, and thereby determine the sentiment of the sentence. They claimed that their approach could enhance the effectiveness of sentiment analysis algorithms, even in the absence of a Chinese corpus and without the costs associated with word segmentation. They attributed the efficacy of their method to its focus on morpheme words with sentiment meaning, e.g., 幸(luck), which is a morpheme of 幸運(lucky). From a linguistic point of view, Ku, Huang and Chen (Ku et al. 2009) examined the morphological structures found in Chinese syntax: compounding, affixation, and conversion. They categorized Chinese compound words into eight morphological types in order to perform sentiment analysis. In an experiment, they searched for the sentiment of words according to morphological type and tested those words in both word-/sentence-level polarities. Although morphological information is seldom applied either in Chinese opinion extraction, or in solving the problems of coverage found in opinion dictionaries, Ku, Huang and Chen reported that the adoption of morphological information improves the performance of word polarity detection. Wang, et al. (Wang et al. 2011) separated sentiment words into static sentiment words (SSWs) (i.e., words whose sentiments do not change), and dynamic sentiment words (DSWs) (i.e., words whose sentiments’ change would depend on contexts), and then computed the morphological productivity of sentiment words. Furthermore, Zhang et al. (2012) introduced an expert system, called as Weakness Finder, which extracted the features and grouped explicit features by using morpheme based method and HowNet based similarity measurement, and identified and grouped the implicit features with collocation selection method for each aspect.

2.3 Comments on literature review

In general, there are three approaches while doing Chinese sentiment analysis: using and extending pre-defined sentiment keyword lists, using basic natural language processing techniques to extract features and sentiments, and adopting English sentiment analysis resources. Each approach has its problems, which are not yet been fully solved.

The annotated resources for sentiment classification in Chinese are not abundant, so pre-defined sentiment wordlists would not work well in most cases. The co-training method tried to adopt English sentiment resources, but the language gap between English and Chinese is not easily eliminated. Probably we need directly apply basic natural language processing techniques, but there are still some problems to deal with Chinese words.

The morpheme-based method used in text mining might solve the Chinese language processing problem, even in the absence of a Chinese corpus. Zhang et al. (2012) proposed an expert system Weakness Finder by analyzing the customers’ reviews on the influential web communities with morpheme based sentiment analysis. However, one should note that they applied morpheme to search for similar concepts in HowNet, which is a predefined word list. It could be claimed in this study that if a dataset is large enough to represent language characteristics in a specified domain, the feature and sentiment compounds would be frequently co-addressed. Therefore, it is our motive to propose morpheme-based sentiment analysis method to extract domain-dependent Chinese morphemes directly from large data set without the help of predefined sentimental resources. We used the data set containing movie reviews written in Chinese as the example for demonstrating the superiority of our method. It is our wish that our method can be applied in situations which proper predefined sentiment resources in Chinese are not available.

3 The proposed approach

3.1 Overview

This study performed sentiment analysis of Chinese without using any resources for sentiment analysis, considering the dearth of annotated resources for sentiment classification. To find sentiment expressions for a given genre and determine the polarity of the sentiment, we adopted the morpheme-based technique of identifying features, to search for corresponding phrases in blogs that express movie review sentiment. We attempted to identify sentence fragments that express the sentiment of the opinions expressed in the text, and to create Chinese dynamic sentiment lexicons that express different sentiment for different contexts.

3.2 Dataset

Assembling a sentiment orientation wordlist from a dataset requires a large dataset containing compound words. We collected movie reviews from Yahoo!Taiwan, and compiled them into a corpus of movie opinions. The corpus contained 127,424 opinions categorized into 18 genres. Each opinion was ranked on a five-star scale. The corpus includes a total of 4,631,482 words (see Appendix I). One movie could belong to one or more genres. The distribution of collected opinions is presented in Fig. 1. The distribution of stars was as follows: one/two stars (22.6 %), three stars (8.6 %), and four/five stars (68.8 %). All movie opinions were first processed according to the part-of-speech (P.O.S) tagger SINICA CKIP,Footnote 1 and then stored as a dataset in the web-based hosting service GitHub.Footnote 2

Fig. 1
figure 1

Distribution of movie opinions collected from Yahoo!Taiwan

3.3 Morpheme-based features and collocations

3.3.1 Selected morpheme

To extract compound words with meaningful sentiment from movie reviews, Chinese morphemes must be first decided. There were 93,871 distinct words, which appeared at least once in the total 4,631,482 words of the corpus. After excluding those compounds that have no essential meaning for our sentimental analysis (e.g. conjunction, quantifier, etc.), there were 74,366 words. Referring to the properties from movie in http://schema.org/, we first listed those movie description words whose frequencies were high at least as first 10 %. Then, we must decide the morphemes. Unlike in English, features in Chinese are multi-syllabic compounds of morphemes that express specific meanings. There were total possible 4,820 morphemes for these 74,366 words. After consulting with two experts about movie feature compounds, we finally identified eight Chinese morphemes that semantically express different features presented in movie reviews and their frequencies were also high at least as first 5 % in these 4,820 morphemes. Table 1 lists eight morphemes used to search feature compounds, and each morpheme presents a different sense regarding to movie features, such as actors, plots, and special effects. Although we can define morpheme roots for feature compounds, different datasets would produce different results.

Table 1 Selected morpheme for movie features

3.3.2 Using results from NLP tools

Compounding words are rooted in morphemes and can be used to express a variety of movie features, such as actors, plots, and special effects. However, understanding the relationship between the meaning of features and parts-of-speech tags can be difficult without considering the context in which the words are used. For example, 編劇 has two senses in Chinese, screenwriter (noun) and screen-writing (verb). Therefore, more information is required from the sentence to identify its meaning.

Word segmentation is the necessary first task in NLP in Chinese; however, it is difficult to ensure accurate results when NLP tools are applied to the extraction of movie features from reviews written in Chinese. When using NLP tools to merge different segments of compounds, the PMI value is an important indicator in determining whether two adjacent words (or compounds) should be merged to create a single meaningful phrase. As shown in Table 2, which outlines the problems associated with the segmentation of compounds in Chinese, we consider the example of 值回票價 (get one‘s money’s worth) in two sentences processed by SINICA CKIP .

Table 2 Chinese word segment: comparson for 值回票價

In the first sentence analyzed in Table 2, the NPL tool merges compounds into a common Chinese idiom 值回票價, because 值回 (worth) and 票價 (ticket price) can be found in adjacent positions with a PMI’ = 18.57, which is relatively high. However, in the second sentence, the grammatical analysis performed by the NPL tool suggests merging 電影 (movie) and 票價 as a noun phrase, and it disregards the high value of PMI’ (“值回”, “票價”) because 值回 and票價 are not adjacent. If a Chinese speaker analyzed the sentiment expressed in the compounds in both sentences, the Chinese idiom, 值回票價, would be easily identified as a positive sentiment toward the movie being considered. However, computing software has difficulty identifying the sentiment orientation of the compound 值回, which is seldom used and applied only for the modification of succeeding compounds. Inaccurate results would be the result of attempting to study the sentiment compounds based only on the results of NLP data processing without calibrating suitable information granularity. On the other hand, it would be unacceptable to divide merged compounds or idioms into smaller information granularity units without considering their grammatical structures, or study sentiment compounds alone without considering their modified features.

3.3.3 Selecting collocations

Turney’s method of analyzing sentiment orientation (described in Section 2.1) has been seldom applicable in Chinese. Furthermore, Chinese NLP tools are unable to identify the correct features and their corresponding sentiment compounds. For example, the word 好 (good, well), as a verb modifier, normally has a positive meaning, such as in 好看 (good-looking, or handsome, tagged as 好看 Vi, an intransitive verb, by SINICA CKIP). Nonetheless, 好 cannot be separated from 看 (look) by tagging tools. On the contrary, 好 takes on a negative connotation in some compounds, e.g., 好難看 (quite bad-looking, or quite difficult to look at, tagged as 好 Vi and 難看 Vi by SINICA CKIP). In fact, 好 can even be segmented as a single word with no sense of sentiment at all.

This study adopted Tureny’s design in proposing a novel method in which features are combined with collocations (i.e., the corresponding compounds that are used to modify features) to facilitate our understanding of the concepts. Shared concepts can be calculated according to the probability of words co-existing in sentences across a corpus. Turney (Turney 2002) suggested using PMI to determine concepts of sentiment orientation shared between extracted phrases and their representative sentiment polarities, which are “excellent” and “poor”. He utilized results from search engines utilizing NEAR operation, which performs a search for co-existing words within a ten word window size in order to identify synonyms (Turney 2001). Given the polarized sentiments “excellent” and “poor”, unknown English words can be found its sentiment orientation by calculating the SO-PMI in Formula 1 within a window size. Unfortunately, as in the above described example (好看 and 好難看), Chinese NLP tools do not provide support sufficient to enable sentiment analysis. The orientation of each feature-collocation combination needs to be considered as a joint conceptual unit, which sentiment orientation is judged according to the sentences in context.

In order to find the corresponding collocations of features, we limited window size to ±10 to select feature compounds, if there were no stop words or end punctuation found within this range. PMI values as low as −2 were permitted because we wanted to extract any shared concepts that could provide clues regarding the sentiments expressed in the opinions within the dataset, even if those concepts were not found in frequently co-existing compounds or common phrases.

3.4 SVM classifier and evaluation

3.4.1 Linear SVM

In order to label each feature-collocation combination as positive or negative sentiment, we adopted SVM model (Vapnik 1995) to learn and classify movie opinions. The idea behind SVM is to find a decision surface over a vector space to enable separating the data into the two classes. This study used a linear SVM model, which considers arbitrary data \( \overrightarrow{x} \) scattered in a separable space, and learns vector \( \overrightarrow{w} \) and constant b from a training set of data, allowing the model to find the decision hyperplane, written as follows:

$$ \overrightarrow{w}\bullet \overrightarrow{x}-b=0 $$

Let training data set \( D=\left\{\left({y}_1,{\overrightarrow{x}}_i\right)\right\} \) be the collected movie opinions, and y i  ∈ {±1} be the positive (+1) and negative (−1) classification for \( \overrightarrow{x} \). The linear SVM problem involves finding \( \overrightarrow{w} \) and b values capable of satisfying the following constraints to minimize the 2-norm of vector \( \overrightarrow{w} \).

$$ \overrightarrow{w}\bullet \overrightarrow{x}-b\ge +1 for{y}_i=+1\overrightarrow{w}\bullet \overrightarrow{x}-b\le -1 for{y}_i=-1 $$

Various researchers (Tan and Zhang 2008; Ku et al. 2009; Sun et al. 2010) have reported that SVM classifiers are more accurate than other classifiers, such as naïve Bayes, conditional random fields, and classifiers based on information gain.

3.4.2 Model evaluation

In order to measure the effectiveness, an F1 measure combining recall and precision (Van Rijsbergen 1979) is usually recommended as an SVM measurement. This study evaluated the precision, balanced accuracy and F1 scores of all the classifiers. The formulae are written below. In these formulae, tp represents true positive (correct results), fp represents false positive (unexpected results), fn represents false negative (missing results), and tn represents true negative (correct absence of results). One should note that in these often used formulae, there is one additional component “# without features”. Previous researchers (e.g., (Ku et al. 2009)) applying the wordlists of NTUSD and HowNet would report the effectiveness after excluding the sentences without identifiable features since the size of their wordlists are constant. However, our method and the TFIDF method extract feature words according to the given corpus; the number of extracted feature words is used to determine the size of the wordlists. Therefore, for fairly comparing different methods, the number of those without identifiable features should be added back to the denominators of the Formula 2, 3, and 4.

$$ Precision(Accuracy)=\frac{\# of\; tp}{\# of\; tp+\# of\; fp+\# without\; features} $$
(2)
$$ Recall(Sensitivity)=\frac{\# of\; tp}{\# of\; tp+\# of\; fn+\# without\; features} $$
(3)
$$ Specificity=\frac{\# of\; tn}{\# of\; fp+\# of\; tn+\# without\kern0.3em features} $$
(4)
$$ F1=\frac{2\times \mathrm{Re} call\times \Pr ecision}{\left(\mathrm{Re} call+ \Pr ecision\right)} $$
(5)
$$ Balanced\; Accuracy=\frac{ Sensitivity\times Specificity}{2} $$
(6)

3.5 Experiment preparation

3.5.1 Define polarity in our data set

This study collected opinions written about movies from Yahoo!Taiwan as experimental data. These user comments are usually short and include a star ranking between one and five stars. These user rankings were utilized as a criterion for the implied sentiment orientation of opinions (a ranking of one to two stars were considered negative; three stars was neutral; four to five stars were considered positive). To eliminate the effect of neutral rankings, we would build three classifiers in our experiment: (1) a positive (“+”) classifier to provide a positive sentiment cluster (four to five stars were considered positive; one to three stars were considered non-positive); (2) a negative (“-”) classifier provided a negative sentiment cluster (one to two stars were negative; three to five stars were non-negative); (3) a positive–negative (“±”) classifier provided both positive and negative sentiment clusters (one to two stars were considered negative; four to five stars were considered positive). In the random selection of opinions from the dataset, we attempted to balance the number of opinions in the positive and negative categories. However, discarding neutral (three-star) opinions proved very difficult, particularly when dealing with a large number of opinions.

3.5.2 Define referencing model

This study utilized morpheme-based feature-collocation combinations to assist in the determination of positive/negative sentiment orientation in opinions. For comparison, we also analyzed the dataset using other feature selection methods: TFIDF, and predefined sentiment keyword lists NTUSD (Ku et al. 2006) and HowNet (actually we used its subset, HowNet Sentiment Dictionary, called as Senti-HowNet) (Dong and Dong 2006).

TFIDF was implemented using the following formula to calculate whether terms should be designated as frequently appearing and thus be used to determine the positive/negative orientation of sentiments expressed in opinions:

$$ TFIDF\left(t,d\right)= tf\left(t,d\right)\times \log \left(N/{n}_i\right) $$

where tf(t,d) is the number of times that term t occurs in document d, N is the total number of training opinions, and n i is the number of opinions containing the word t.

Although NTUSD and HowNet distinguish between positive and negative words when a lexicon is applied to sentiment analysis, a number of words were not chunked or segmented in the same manner for the processing of opinions. For example, the concept element “悲傷” can be found different forms, such as “悲傷的”, “使悲傷” and “極度悲傷”. To further analyze the results produced by NTUSD and HowNet, we input both sets of results into the SINICA CKIP.

Although NTUSD and HowNet both label the sentiment orientation of each compounds, the P.O.S tool occasionally labels compounds with different orientations. Furthermore, following a pilot experiment using 40,000 random opinions, the performance of both keyword lists were disappointing when only one-sentiment orientations were used (see Appendix II). Thus, we disregarded pre-defined sentiment orientation and ultimately combined both positive and negative compounds to build the classifiers used in our experiment. In addition, the HowNet wordlist was originally encoded in simplified Chinese. Before inputting the lists into the P.O.S tool, these words were translated into traditional Chinese by referencing simplified/traditional Chinese conversion tablesFootnote 3 from Wikipedia. We eventually employed 6,510 processed features in NTUSD and 7,555 features in HowNet for classifier training in our experiment scenarios. In the application of TFIDF for feature selection, we limited the maximum term size to 8,000 in order to extract only the most meaningful compounds for classification.

3.5.3 Pre-processing

Pre-processing involved sending user opinions to the SINICA CKIP for chunking and segmentation into fundamental concept units. Although P.O.S information is available for each compound segment, not all compounds are meaningful or open to interpretation in terms of sentiment orientation. Therefore, we used a tag-list (see Appendix III) to filter out unwanted compounds and considered only those compounds with essential meanings, which are the most likely to be P.O.S tagged as verbs and nouns in opinions.

To extract the correct sentimental orientation implied by opinions, we referenced the idea of negation tagging from Das and Chen (Das and Chen 2001). In their work, the words “not,” “no,” and”never” in English were deemed as negative compounds. In this study, seven compounds were designated as negative compounds: ‘不’, ‘沒有’, ‘不要’, ‘不能’, ‘沒’, ‘無’, ‘不會’ in Chinese. Das and Chen assumed that every word between a negation word and the first punctuation mark following the negation word would be affected. Due to fundamental differences in Chinese grammar, we designated the phrase preceding the sentence boundary tag as the area affected by the negative words. In addition, Das and Chen only marked negation as “-”. This study marked each feature compound either with “+” (representing none or even number of negative words within a given range) or “-” (representing odd number of negative words). This resulted in an increase in the total number of compounds.

3.5.4 The whole processing procedures

As depicted in Fig. 2, before building any classifiers, we fed dataset D into the SINICA CKIP, removed unwanted words, and added negative/positive markers to obtain tagged-D. For example, the second sentence in dataset D would become “只有 + 特效 + 不錯 + 劇情 + 老梗 + 又 + 重點-”, where “+” is a positive symbol, and “-” is a negative symbol. The negation mark in compound “重點-” (i.e., “focus –”) is due to the phrase “沒有” (i.e., “no” or “none”) in the sub-sentence (“又沒有重點”). From the tagged-D, we obtained morpheme-based feature compounds and collocation compounds, i.e., the compounds with sentiment orientations waiting to be judged. We then searched for compounds containing selected morphemes and collocations within a window size of 10, filtered out compounds that did not have a PMI value exceeding −2, and constructed the feature set F for SVM. For example, in Fig. 2, we can see “只有+”, marked as cc3 in Fig. 2, has been filtered out because the PMI (“特效+”, “只有+”) value is −3.1 and no other PMI containing cc3 has value greater than −2. With the assistance of feature set F, we compiled D(tagged) into a bag-of-word (BOW) matrix, in which BOW[d i , f j ] would be marked as 1 if a feature f j existed in the sentence d i ; otherwise 0. For example, the first sentence “劇情 + 緊湊 + 結局 + 意想不到+” in D(tagged) would become “1, 1, 0, 0, … 1, 1, 0, 0,”, because the sentence includes fc1 = “劇情+”, fc2 = “結局+” cc1 = “結局+” cc2 = “意想不到+”. Finally, we sent the BOW matrix to Linear SVM to classify the sentiment polarity of compounds to obtain our Target results for positive classifiers, negative classifiers, and positive–negative classifiers.

Fig. 2
figure 2

Processing of data into SVM

3.5.5 Implementation tools

This study used scikit-learnFootnote 4 (Pedregosa et al. 2011), a machine learning library used for Python, to implement linear SVM model and Natural Language Toolkit (NLTKFootnote 5) (Bird 2006) to calculate the PMI of bigrams.

4 Experiment and results

In this section, we report three results of our experiment. First, without considering genres, the proposed method was applied to build sentiment classifiers for various portions of the training data and compared the results of the proposed method with those of TFIDF, NTUSD, and HowNet. Second, we compared the results of applying these methods to the dataset, while taking genre into account. Third, we compared the compounds obtained using the proposed method with the word-lists in NTUSD and HowNet. In each experiment, 10-fold cross-validationFootnote 6 was used for training models before making predictions for the test set. In the following, notation “+” represents a positive classifier, “-” represents a negative classifier, “±” represents positive–negative classifier, “M” represents the proposed method (morpheme-based feature-collocation pairs), “N” represents the application of NTUSD, “H” represents the application of HowNet, and “T” represents the application of TFIDF.

4.1 Opinions of movies

As shown in Fig. 1, movie rankings posted on the Yahoo! Taiwan contain more positive star-rankings (four and five stars, totaled to 89,357) than negative rankings (one and two stars, totaled to 27,209). Therefore, the dataset is considered inherently unbalanced. Considering that the number of total negative ranking opinions was less than 22,000, the training set would remain unbalanced even if we selected more than 44,000 opinions. In the experiment, the dataset was chunked into smaller segments of 5,000, 10,000, 20,000, 30,000, and 40,000 items, for examining the efficiency of classifiers in different size of balanced training set. To test unbalanced datasets, we used 120,000 training sets to build classifiers.

As illustrated in Fig. 3, in the training phase, our morpheme-based method achieved the highest score among the methods (the average F1 approximately 0.8 in 10-fold training phrase), regardless of whether the dataset was balanced, or whether the classifier was positive, negative, or positive–negative. As shown in Fig. 4, in the prediction phase, our method still obtained a higher score than the other methods (our method archived 0.91 in average F1 score in prediction phrase) regarding all types of classifiers. The TFIDF method performed the worst due to the application of the most frequently-appearing (common) words, rather than words with significant sentiment.

Fig. 3
figure 3

Average F1 score in the training phase

Fig. 4
figure 4

Average F1 score in the prediction phase

We then compared the Accuracy (see Formula 2) and Balanced Accuracy (see Formula 6) of each of the methods. As shown in Fig. 5 and 6, we were unable to detect significant differences between the results from “+”, “-“, and “±” type classifiers; however, the proposed approach still outperformed the other methods. The average balanced accuracy rate of classifiers using TFIDF, HowNet, NTUSD, and our Morpheme-based methods were approximately 0.2, 0.5, 0.6 and 0.7, respectively.

Fig. 5
figure 5

Accuracy in the prediction phase

Fig. 6
figure 6

Balanced accuracy in the prediction phase

Figure 7 presents the ratio of sentences without identifiable features for each method (our method, TFIDF, HowNet, and NTUSD). The sentences without identifiable features are those in which we cannot find any feature with the given method. As seen in the figure, the no identifiable feature ratios of NTUSD and our morpheme-based method are lower than those of TFIDF and HowNet. The relatively high no identifiable feature ratio in HowNet is perhaps due to the differences between simplified and traditional Chinese, despite of our efforts to translate them. On the other hand, NTUSD is an effective wordlist complied by Taiwanese students (Ku et al. 2006), and it was shown to fit our test dataset well. Nevertheless, our morpheme-based method has an even lower no identifiable feature ratio than the fixed-wordlist versions of HowNet and NTUSD.

Fig. 7
figure 7

Sentences without identifiable features within the test set

It should be noted that the feature size of NTUSD and HowNet wordlists remained constant, whereas the size of wordlists used in the TFIDF method (maximum size 8,000) and the proposed method (no size limit) would depend on the number of extracted feature words from the given training set. Fig. 8 reports the number of extracted compounds, consisting of morpheme-base features and collocations. We observed that in a relatively small training set size of 5,000 (i.e., test set size is 111,566), our morpheme-based method used only 1,200 compounds (Fig. 8) to achieve an F1 score of 0.77 (Fig. 3) and a balanced accuracy score of 0.7 (Fig. 6).

Fig. 8
figure 8

Ratio of morpheme-based features and collocations in various training sets

In summary, with the size of the training set at 30,000 (i.e., test set size is 86,566) and 40,000 (i.e., test set size is 76,566), the size of compound extraction was approximately 4,100 and 5,000, respectively. All classifiers (“+”, “-”, “±”) reported an F1 of approximately 0.82 (Fig. 4), a balanced accuracy rate of approximately 0.8 (Fig. 6), and a ratio of sentences without identifiable features of approximately 2 % in the test data sets (Fig. 7). Furthermore, in the scenario with an unbalanced training set (size 120,000) (i.e., test set size is 31,300), our morpheme-based method produced approximately 7,700 classifiable compounds and produced the best results among all methods (average F1 score of 0.79, average accuracy of 0.9, average balanced accuracy of 0.7, and average percentage of sentences without identifiable feature of 1.5 %). Compared with fixed size wordlists NTUSD and HowNet, the proposed method extracted only slightly more compounds (features and collocations); however, it suited our data sample very well.

4.2 Multiple movie genres

This study compared the performance of these methods when considering multiple movie genres. We first grouped the movie genres listed in Fig. 1 into six categories, according to similarities among genres. These groups were as follows: (1) “A,B” Group was Fantasy and SciFi; (2) “C,F” Group was Crime and Actions; (3) “D,E,H” Group was Drama, Romance/Family, and Love Story; (4) “P,Q,I” Group was Animation, Comedy and Adventure; (5) “K,R” Group was Terror and Mystery/Thriller; (6) “G,J,L,M,N,O” Group was others.

The training data selected from each genre group contained a maximum of 40,000 opinions, and 10-fold validation was used for each classifier in the constructed phrases. As shown in Fig. 9, TFIDF and HowNet performed virtually the same with regard to average balanced accuracy, while our morpheme-based method outperformed both in each of genre. Furthermore, from Figs. 10, 11, 12, we can see that in spite of classifier types (“+”, “-”, “±”), the proposed method processed the test data more effectively than the other methods did.

Fig. 9
figure 9

Average balanced accuracy in each genres group in training

Fig. 10
figure 10

Average balanced accuracy in each genres group in prediction

Fig. 11
figure 11

Ratio of sentences without identifiable features for each genre

Fig. 12
figure 12

Ratio of morpheme-based features and collocations in genre group training sets

One interesting observation is that group 5 (Terror and Mystery/Thriller) got the best performance; group 6 (Others) performed the worst, in each method except for TFIDF. The further analysis indicated that the sentimental compounds used in genres “K” and “R” have particularly high overlapped (the intersection is about 92 % of “K”). It means that users wrote similar compounds to describe movies in group 5. But it was not the case in group 6 which the compounds are too diverse. However, in all of six groups, our method performed better than other three methods as shown in Fig. 10 (the red line).

4.3 Sentiment orientation of compounds

Figure 13 illustrates that only about 30 % of the selected compounds in our morpheme-based method appeared in the NTUSD wordlist and about 10 % appeared in the HowNet wordlist. This is possibly due to the fact that sentiment wordlists NTUSD and HowNet were not specifically designed for the analysis of movie reviews. In addition, in the ratio of overlapping sentiments as determined by NTUSD, the number of negative compounds was significantly higher than that of positive compounds; conversely, in the HowNet overlapping ratio this situation was reversed.

Fig. 13
figure 13

Selected compounds overlapped with NTUSD and HowNet

Table 3 presents examples of compounds, including Chinese idioms, slang, and popular terms, which were identified using the proposed method, but did not appear in NTUSD or HowNet. It demonstrates that those predefined wordlists are not suitable for the analysis of movie reviews. They are for general purpose, not for specific domains. On the contrary, our proposed method can operate in a variety of domains, such as movie reviews, and provide higher accuracy in the provision of sentiment words than other methods.

Table 3 SVM weights of selected compounds in movie genre groups

Our results demonstrated the ability of the proposed method in identifying compounds with differing sentimental orientation in different genres. For example, the compound “驚悚” (“terrifying”) in a common sense and in the sentiment wordlists NTUSD and HowNet would have negative sentiment. However, in Group 5 (Terror and Mystery/Thriller), our method reported that “驚悚” possesses a positive connotation of 0.24 and “不驚悚” (“not terrifying”) has a negative value of −0.40. Naturally, it is necessary for a horror movie to be terrifying; viewers would be disappointed if that were not the case. Take another example: in common sense and in NTUSD and HowNet, the word “醜”(“ugly”) is considered to be negative, and used to describe one’s appearance as hideous or unsightly. However, the proposed method attributed a positive value of 0.04 in Group 1(Fantasy&SciFi) and a positive value of 0.25 in Group 5 (Terror and Mystery/Thriller). According to the examples in Fig. 14, it is clear that this term would have positive connotations in these genres.

Fig. 14
figure 14

Sample Opinions including “醜” (i.e. “ugly”)

5 Conclusions and future research

If the approach developed for English were adopted directly, sentiment analysis in Chinese would be subject to many forms of bias. This study proposed a morpheme-based method of feature selection to search for domain-dependent Chinese compound words directly from the reviews in a large data set without any help of predefined sentimental resources. Our method uses a P.O.S tagger tool for the segmentation of texts, filtering morpheme-based features and extracting appropriate collocations using relatively high PMI values to build sentiment classifiers. Results show that the proposed method is capable of achieving a higher level of balanced accuracy with small size of extracted feature and collocation compound set, providing a higher hit rate for features when new opinions are introduced. The proposed method also maintains this good performance across movie genres. Compared with pre-defined wordlists that rely on single polarity, the proposed method is better able to identify the sentiment of words, which can vary in polarity according to the genre of movie. This study only took move review as an example. However, the proposed approach is domain- independent and would extract domain-dependent words from a given data set.

This study was subject to a number of limitations. Our morpheme-based method did not take into account semantics and degree of adverbs (e.g., “very” good). Future research could explore the possibility of introducing semantics and degree of sentiment into our approach. In addition, some products or services may have several aspects to be reviewed. For example, a reviewer may comment on dishes, environment, and waiter service of a restaurant, and give different scores. Future morpheme-based method may explore how to identify these different aspects from commented opinions. Finally, in future research, we could investigate the possibility of applying weights to words according to the distance from the target compounds, when employing PMI for filtering.