1 Introduction

With the explosion of information on the Internet, it is hard to make decisions based on reviews, tweets, etc. People purchase products on the Internet and immediately express their opinions. These opinions have a significant effect on the financial statements of the involved companies. The main problem in this process is the nature of the natural language of the expressed opinions. There exists a big gap between opinions in natural language (i.e., unstructured data) and where structured data applications are applied [1].

The knowledge stored as text, documents, video, and voice media formats exceeds 80% of its volume. In the field of computer science, these documents have an unstructured nature. In knowledge extraction, realization is must before searching the implicit meanings and concepts. Idea mining in any text is attributed to the technical phase of what humans can search for. Keywords are the keys sought by the search engines in finding text data, based on the probable presented facts, not ideas. Expressing ideas through keywords is impossible [2].

Sentiment classification (SC) is an appealing field in text mining. The extracted opinions from the unstructured data on the Internet become classified as positive, negative, or neutral. In this context, the three levels of document, sentence, and feature are of concern. According to Pang and Lee, the classes of these three features are determined separately [3].

Pre-processing is greatly contributive to SC. Most of the available studies are focused on traditional text classification approaches where a document is considered as both a bag-of-words (BoW), and part-Of-speech (POS) tagging [4, 5]. It is revealed that POS tags could not provide enough information for natural language processing (NLP) analyses. The POS tags will add unnecessary complexity, while the words are proper indicators for sentiment polarity detection [6]. BoW does not record multiple relationships among words; therefore, the researchers of this paper have added a multi-objective algorithm to extract better features.

The SC methods in practise are machine learning and lexicon-based [7,8,9,10]. Regarding SC, the machine learning method is adopted in most studies [5]. The advantage of this newly proposed method, as opposed to its counterparts, is in its ability to identify the non-emotional terms that make sense in a sentence (e.g. rich in the sentence ‘This article is rich.’). The Corpus-based [11] and Dictionary-based [12,13,14,15] methods are being applied in lexicon-based classification where the routine is to, first, search for every single word of a sentence in the sentiment lexicon dataset and, next, to extract the sentiment label of the word, provided that it is present in the network. Customer assistance in choosing a product might be the most favourite aspect applied in opinion mining. Before purchasing a product from e-shops, it is rational for the new customer to read the comments on the given product to be aware of its properties and compare it with like products.

The main objective of this article is to propose a framework to decrease the dimensions of the features in the multi-class SC where both the multi-objective grey wolf (MOGW) algorithm and the neural network (NN) classifier are applied. Non-existence of proper ‘decision-maker’ as to control keyword extraction among the comments and to reduce the input data dimensions makes the proposal of a framework necessary to face this challenge. The algorithm must read the information from the database, consisting of two training and testing sections, and send it to the pre-processing unit where the decomposition of sentences into words allows the ‘stop words deletion’. For the most important selected keywords, term frequency (TF), inverse document frequency (IDF), and inverse class frequency (ICF) weighting mechanisms are extracted and applied by the feature extraction unit. It is sought to enhance the classification performance by discrete grey wolf optimization than the other metaheuristic algorithm [16, 17]. Making the right choice regarding the worth of each element in the grey wolf algorithm is by applying three weights and two objectives outlined in this article. Better results can be obtained by applying two monitoring levels, that is, less domination and more crowding distance. In this framework, to find the best worth the accuracy evaluation parameter is applied in the objective functions design. The input structure in the classification unit consists of these features to develop the final model. The classification unit is where the data are classified and the final model is developed.

1.1 The innovative contributions

The innovations in this article consist of:

  • Combined nature of the framework.

  • A two-objective framework is developed for multi-class SC on movie reviews and Twitter.

  • Error reduction of Naïve Bayes (NB) classifier and the K-Nearest Neighbour (KNN) algorithm objectives and selection of most prominent features involved in a discrete multi-objective grey wolf algorithm.

  • Selected features and input data dimensions are reduced eliminating some tested features by application of the evolutionary MOGW optimization algorithm.

This article is organized as follows: The literature is reviewed in Sect. 2; the framework is proposed in Sect. 3; experiments and results are presented in Sect. 4; discussion is run in Sect. 5, and the article is concluded in Sect. 6.

2 Literature review

Many studies exist having different outlooks intending to improve the classification performance on the known datasets. These studies vary based on the applied classifiers and Internet online forums. Some resources that provide insight into the state-of-the-art literature are in Table 1 of the current research.

Table 1 A briefing of the studies run on SC

Manek et al. [18] presented a combined algorithm including the Gini index-based feature selection and SVM classifier to classify large movie review dataset(s) and the obtained results indicate 92.8% accuracy and efficient error reduction rate. Zhuang et al. [19] proposed a multi-knowledge-based approach for movie review mining where the focus is on a specific movie genre review. The applied knowledge is a combination of integration-based WordNet, statistical analysis, and movie knowledge. Severyn et al. [20] proposed a method including the following three stages: (1) classifying the polarity, (2) testing structure with the domain, and (3) focus on English and Italian languages. Moreover, the tree kernels method is applied for better feature extraction. The obtained results confirm the efficiency of this method when the review domain is more than 4%, applied even in case of very low resources. Poria et al. [21] applied a 7-layer convolutional neural network to review all aspects of a text and analyze the sentiments showing speech patterns are mixed with the neural network to yield better results obtaining the highest precision of 92.7%. Chen and Qi [22] analysed opinions of the users to adopt a supervisor method called conditional radom field (CRF); a conditional probability distribution on an undirected graph. The data necessary for their article were collected from Yahoo and Flickr. The two contributing factors consist of the product features and user comments. The experimental results indicate that three-fourths of the decisions adopted are influenced by other user opinions, and just one-fourth is influenced by product features. Accuracy results are implemented on two purchased products. Chaovalit and Zhou [23] proposed an opinion mining method based on semantic orientation where compound words are applied and then graded. In their article, a comparison is made between the machine learning approaches, indicating that the accuracy of machine learning is evident at approximately 85.7%; something noticeably better than semantic orientation methods. Dave et al. [24] analyzed the comments of Amazon and C/net users by running two different experiments on the available data; the count of positive comments is considered to be five times more than the negative comments in training, and the positives and negatives are assumed equal if there exists an equilibrium. Experimental results indicate that the more compound words in classification, the more accurate the results. In the articles discussing the supervised models, the accuracy percentage is always higher than that of the ones discussing opinion mining in semantic orientation. This difference is due to the incompatibility of training data thereby preventing their comparison. For proving better efficiency of the semantic-oriented method in comparison with the machine learning methods, an experiment is run on the same data by implementing the SVM algorithm. Kumar and Jaiswal [25] applied binary grey wolf and moth flame for feature optimization to enhance the accuracy of the SC performance. Five baseline classifiers; namely NB, SVM, KNN, multilayer perceptron, and decision tree are applied to extract these features. The study was run on tweets in SemEval 2016 and SemEval 2017 obtaining the highest accuracy at 76.5% for SVM with a binary grey wolf optimizer based on the SemEval 2016 benchmark dataset. The grey wolf optimizer is applied for text clustering. Rashaideh et al. [26] applied the average distance of documents to the cluster centroid as an objective target to optimize the distance between the clusters of documents in a continuous manner. Evaluation of this method was based on six text documents selected from available public datasets in a random manner. Kumar and Khorwal [16] proposed a method for feature selection in SC where the Firefly Algorithm and SVM are applied as the feature selection and classifier, respectively. Shang et al. [17] applied the particle swarm optimization (PSO) Algorithm with three classifiers as a feature extraction.

As observed in Table 1, there exist a few studies run on SC, which can provide high classification performance through the grey wolf algorithm on movie and Twitter datasets. The main difference between this study and others like it is the fact that this framework explores features through the discrete structure of a two-objective grey wolf algorithm based on the three weighting mechanisms. Kumar and Jaiswal applied many classifiers in the classification stage [25]. The novelty of our research is applying a two-objective evolutionary algorithm with the ability of feature selection in a discrete manner. Kumar and Khorwal, and Shang et al. applied only one objective [16, 17]. The reason for the obtained better results is due to applying two monitoring stages, that is, less domination and more crowding distance. This proposed framework provides better results in terms of accuracy, precision, recall, and f-measure on the three datasets.

3 Proposed framework

The scheme of this framework is shown in Fig. 1 where, first, user comments are put in the request queue and are then loaded into the dataset. The stored data are proportioned in 70% and 30% as the testing and training datasets, respectively. A portion of the dataset space is allocated to the processed data at each stage. The main stage in this proposed framework is the sentiment processing engine consisting of the pre-processing, selection of important features through MOGW optimization algorithm, and classification through the multi-layer NN stages.

Fig. 1
figure 1

The proposed framework

In the pre-processing stage, the user comments are extracted first, and next, Tokenization takes place to split an opinion into a list of sentences, thereafter, a sentence is decomposed into words. Words are first stemmed, then the ‘stop words’ are eliminated. The remaining words are simplified and, finally, the weight of words is calculated. The weighted words are stored in the processed data section of the dataset to be applied in the feature selection stage. In this stage, the processed data of the dataset are read, and the most important words are selected through the MOGW algorithm based on the two mentioned objectives then stored in the processed data. In the final stage, the model of classification is obtained, that is, a multi-layer NN is trained based on the processed data and the training data inputs. The obtained model is evaluated through the test data. The multi-layer NN setup and the final model produced through this proposed framework would be applied in classification. The results are obtained based on the final model and input data to be applied in knowledge management. The variables and notations applied in this article are in Table 2.

Table 2 Variables with their description

The pseudocode of the base structure for this proposed framework is expressed as:

figure a

3.1 Pre-processing stage

3.1.1 Opinions decomposed into sentences and words

Opinions consist of some sentences, so first, the set S is formed which contains the sentences, next, each sentence is analysed to form a T set which contains the words of each sentence [33, 34]. Sentence Tokenization is a process of decomposing an opinion into a list of sentences. Tokenizing a paragraph into sentences means each sentence can be a token, and in a similar sense, a word can be a token when tokenizing a sentence into words. The extracted sentences are tokenized into words, and then, their earlier editions are applied.

3.1.2 Deleting stop words

Performance speed is an important factor, so the ‘stop words’ must be ignored. Stop words (e.g., a, about, all, am, did, has, have, etc.) are commonly applied in English with no contribution in recognizing their importance. Elimination of these words accelerates method functionality. The stop word count in this proposed method is 119.

3.1.3 Stemming the word

Word stemming refers to accurately recognizing and carefully registering the frequency of the words. Stemming is contributive to converting words into their simplest variant, like omitting ed from past tenses, ing from a present participle, and s and es from nouns, etc. Precision in recognizing similar words would improve by running this process accurately.

3.1.4 Extraction of the weight through the three mechanisms

The extraction process is an essential factor in this framework; TF, IDF, and ICF mechanisms are implemented and explained [10, 33, 34].

Term frequency: A frequent presence of a term, in all the comments, count (e.g., the total TF (I, d) as the frequency of term I in Document, and d is determined through TF.

Inverse document frequency: The word count repeatedly and commonly applied in a text is determined through IDF; calculated through Eq. 1:

$${\text{IDF}}_{i} = \log \frac{{{\text{TD}}}}{{{\text{DF}}_{i} }}$$
(1)

where TD is the total count of the comments and \({\text{DF}}_{i}\) is the comment containing the word i count.

Inverse class frequency: The count of repeatedly and commonly applied words in a class is determined through ICF; calculated through Eq. 2:

$${\text{ICF}}_{i} = \log \frac{{{\text{TC}}}}{{{\text{CF}}_{i} }}$$
(2)

where TC is the class count, and \({\text{CF}}_{i}\) is the class count containing the word i count.

3.2 Feature selection using MOGW algorithm

Extraction of the important features through the MOGW Algorithm is based on the combination of decreasing the error rate of NB and the KNN objectives where NN is applied as the final classifier. The structure of the MOGW optimizer algorithm for selecting important features (words) based on the weight of words listed in the following six stages:

  1. 1.

    Generating the initial position of the wolves based on the total count of words

  2. 2.

    Determining the worth of each wolf in population based on fitness function

  3. 3.

    Classifying the wolves into four groups of alpha, beta, delta, and omega

  4. 4.

    Movement of wolves toward best wolf of the group and the alpha wolf

  5. 5.

    Deleting the wolves of low worth

  6. 6.

    Selecting best wolf concerning dominance and crowding distance

The need-assessment structure of the MOGW Algorithm for feature selection is expressed through the following pseudocode:

First, the initial population and count of generations are determined; next, a random number between 1 and the count of the words is selected as the count of the initial population. The worth of each member in the population will be determined based on words, their weight, and the error rate of the KNN and NB classifiers (lines 2–6). By considering the gained worth and secondary worth, by applying the crowding distance, the population is sorted, and alpha, beta, delta, and omega populations are formed (lines 7–9). The main circle of the proposed GWO algorithm is repeated with the count of generations (line 10). A new population is formed based on the movement of wolves toward the best wolf in its group and toward the alpha wolf, known as the new wolves (line 11), which would be a combination of the previous population and the population formed by the movement (line 12). The worth of each member in this population is determined also based on words, their weight, and with respect to the error rate of the KNN and NB classifiers (lines 13–15). As to the gained worth and secondary worth, by applying the crowding distance, the population is sorted and classified (lines 16–18). Members would be selected from the first rank of the newly classified population to be counted as a member of future generation based on the initial population count (line 19).

figure b

3.2.1 Initial population

In this structure, each element (wolf) is an array of bits. The initial population is formed on a random basis. Each wolf takes the count of 1 and 0. If N is the word count, this array is defined as follows:

$$X = [{\text{bit}}_{1} ,{\text{bit}}_{2} , \ldots ,{\text{bit}}_{{N{\text{var}} }} ]$$

If one of the values in an array is zero, the word is ignored, while if 1, the word will be selected as seen in Table 3.

Table 3 The structure of each element of the initial population

A wolf is represented by several 1 s and 0 s. If the count of the selected words in all comments is 15, the formed wolf would have 15 bits as the initial population, which, if 1, the word related to this index would be selected and considered by the selection operator, and if 0, the sentence would be ignored. Indices are selected randomly. When indices 3, 6, 8, and 14 are selected, the details of the formed wolf are according to the content of Table 4.

Table 4 A sample of formed wolf structure for 15 words

3.2.2 Fitness function using two objectives

The suitability (or profitability) level of the wolves is determined through this function. Here, two objectives, each with its own worth, are considered for each element of the initial population.

3.2.2.1 The first objective: calculating the error through the KNN

The KNN algorithm is applied to calculate the value of the indices based on the data available in the database. KNN is a supervised training algorithm, applicable in estimating the density function of distributing training data and classifying the test data according to the training models. KNN is one of the simplest and most common methods based on sample learning.

It is assumed that all samples constitute points in n-dimensional real space, where, the neighbours are determined based on the standard Euclidean distance, and it is assumed that K is the neighbours count. The Euclidean distance is one of the most important factors in finding the neighbours. The Euclidean distance is obtained as a feature vector through Eq. 3:

$$< a1(x),a2(x), \ldots an(x) >$$
(3)

The Euclidean distance between the \(x_{i}\) and \(x_{j}\) samples is obtained through Eq. 4 [35].

$$d(xi,xj) = \sqrt {\sum\limits_{r = 1}^{n} {\mathop {(ar(xi) - ar(xj))}\nolimits^{2} } }$$
(4)

This proposed algorithm can be obtained by assigning a weight to each K sample of the neighbourhood, based on the test sample distance to other samples, and usually, there exists an inverse relation. Based on this assignment, the classification of all samples becomes possible instead of applying K neighbour samples at a lower speed. As to discreteness, Eq. 5 and as to continuousness Eq. 6 is applied [36].

$$\hat{f}(xq) \leftarrow \mathop {\arg \max }\limits_{v \in V} \sum\limits_{i = 1}^{k} {wi} \delta (v,f(xi))\quad \quad {\text{where}}\;wi = \frac{1}{{d(xq, xi)^{2} }}$$
(5)
$$\hat{f}(xq) \leftarrow \frac{{\sum\limits_{i = 1}^{k} {wi \, f(xi)} }}{{\sum\limits_{i = 1}^{k} {wi} }}\quad \quad {\text{where}}\;wi = \frac{1}{{d(xq,xi)^{2} }}$$
(6)

Applying even the unrelated features is possible when all features are involved in calculating the distance. This contradicts the Decision Tree method where only the related features are involved. Assume that every sample is determined by twenty features where only two are enough for classification and can be distanced from each other. This indicates that the distance applied in KNN is misleading. Applying more weight to related features is a possible solution. The solution, which is similar to changing the scales of the axes, indicates that the axis of the related features is shorter and the same of the unrelated features are longer. The cross-validation method is applied to determine the weight feature when a set of data is selected as the training data. The coefficients of \(z_{1} , \ldots ,z_{n}\) volumes are selected to be multiplied by the vector of each axis to decrease the classification error in other samples. One or some feature(s) effect is or are completely ignored at \(z_{j} = 0\).

3.2.2.2 The Second Objective: Calculating the error through the NB

Since the nature of the features is numerical, the NB decider is applied here where Eq. 7 is applied [37] to calculate the probability of event A in column A, provided that class C holds true:

$$P(K = A|C) = \frac{1}{{\sqrt {2\pi \sigma^{2}_{K = C} } }}{\text{e}}^{{ - \frac{{A - \mu_{K = C} }}{{2\sigma^{2}_{K = C} }}}}$$
(7)

where \(\mu k = c\) is the column K mean, while the row belongs to the class C, and \(\sigma_{k = c}^{2}\) is the variance of the kth therein, and no input classification is required.

The output of the stage is the value of the error rate. Now it can be determined which one of the values is higher. The dominant wolf with a lower error rate is selected as the new population in the next generation. There exists an indirect relation between gained volume error and the produced output function efficiency. Equation 8 of [37] computes this error in its regressed sense:

$$S = \sum\limits_{i = 1}^{n} {|y_{i} - f(x_{i} )|}$$
(8)

where \(y_{i}\) is the real output of the main class and \(f(x_{i} )\) is the output computed by classifiers NB and KNN. At this stage, there exist two numbers (two objectives) for each wolf, which are applied in the next stage to determine the best wolf.

3.2.3 Selecting the best wolves (first front)

One of the most important features of the non-dominated MOGW algorithm is its ability to select the next generation where the Pareto, or set of non-dominated solutions, is determined first. To understand the concept of domination, the following definitions are of concern [38, 39]:

  • Strict Domination: Wolf \(P_{2}\) is strictly dominated by wolf \(P_{1}\), and if in all fitness functions \(P_{1} \prec P_{2}\), then the strict domination is obtained through Eq. 9.

    $$F_{i} (P_{1} ) \prec F(P_{2} )\quad \quad \forall i = 1 \ldots m$$
    (9)
  • Weak Domination: Wolf \(P_{1}\) can have weak domination on wolf \(P_{2}\) and \(P_{1} \prec P_{2}\) in all fitness functions and \(\succ p_{2}\) in at least one fitness function. The term \(P_{1} \le P_{2}\) indicates weak domination obtained through Eq. 10.

    $$\begin{gathered} F_{i} (P_{1} ) \prec F(P_{2} )\;{\text{for}}\;{\text{at}}\;{\text{least}}\;{\text{one}}\quad \forall i = 1 \ldots m \hfill \\ F_{i} (P_{1} ) \sim \succ F(P_{2} )\quad \quad \quad \quad \quad \quad \forall i = 1 \ldots m \hfill \\ \end{gathered}$$
    (10)
  • Neutral: Wolf \(P_{1}\) will be neutral to wolf \(P_{2}\) if the value of \(P_{1} \prec P_{2}\) in some fitness functions and \(P_{2} \prec P_{1}\) in other functions. The \(P_{1} \sim P_{2}\) is an indicator of neutrality.

In the MOGW algorithm structure, elements from the population should be selected for the next generation with either strict or weak domination on other elements. In this context, the strictly dominated elements are selected first, followed by the weak ones. This proposed method provides a matrix to save the output of fitness functions for each wolf; Table 5.

Table 5 The structure of fitness function output for wolves

After tabulating this table, a new developed matrix called the ‘domination matrix of a square structure’ is evident for an equal count of columns and rows equal to the count of the wolves. If the wolf i dominates wolf j, in either a strict or weak sense, the intersection cell of row i or column j will be valued first as 1; otherwise, 0 and the following will be valued, and the sum of each column is calculated. The wolves are sorted according to these ascending counts. The words, with low sum value, are the best Pareto next generation nomination. The wolves, with an equal sum of domination, are placed in the same group. The decision for the next generation is drawn according to this domination count. Here, the structure performance is reviewed in an example where wolves \(P_{1}\) to \(P_{6}\) are considered as the current generation. The details of the value saving matrix for objective functions are in Table 6.

Table 6 The value saving matrix

Table 6 provides the domination and the values of the wolves. As to the following calculations for wolf \(P_{1}\), the following must hold true: \(P_{1}\) is neutral in relation to \(P_{2}\), while it strictly dominates \(P_{3}\), \(P_{4}\), \(P_{5}\), and \(P_{6}\), thus, in the first row, the domination of \(P_{1}\) is 0, \(P_{2}\) is 0, and \(P_{3}\) to \(P_{4}\) is 1.

$$\begin{aligned} & P_{1} \sim P_{2} \quad \to \quad \quad \quad 0.05 \succ 0.03\quad \quad \quad 0.01 \prec 0.02 \\ & P_{1} \prec P_{3} \quad \to \quad \quad \quad 0.05 \prec 0.08\quad \quad \quad 0.01 \prec 0.03 \\ & P_{1} \prec P_{4} \quad \to \quad \quad \quad 0.05 \succ 0.07\quad \quad \quad 0.01 \prec 0.06 \\ & P_{1} \prec P_{5} \quad \to \quad \quad \quad 0.05 \succ 0.21\quad \quad \quad 0.01 \prec 0.11 \\ & P_{1} \prec P_{6} \quad \to \quad \quad \quad 0.05 \succ 0.14\quad \quad \quad 0.01 \prec 0.12 \\ \end{aligned}$$

Consequently, the first row of the domination matrix at the first stage appears in Table 7.

Table 7 The sample' s domination matrix at the first stage

As to the following calculation: \(P_{2}\) is neutral in relation to \(P_{1}\), while it strictly dominates \(P_{3}\), \(P_{4}\), \(P_{5}\), and \(P_{6}\), thus, in the second row, domination for \(P_{1}\) is 0, \(P_{2}\) is 0, and for \(P_{3}\) to \(P_{6}\) is 1.

$$\begin{aligned} & P_{2} \prec P_{3} \quad \to \quad \quad \quad 0.03 \prec 0.08\quad \quad \quad 0.02 \prec 0.03 \\ & P_{2} \prec P_{4} \quad \to \quad \quad \quad 0.03 \prec 0.07\quad \quad \quad 0.02 \prec 0.06 \\ & P_{2} \prec P_{5} \quad \to \quad \quad \quad 0.03 \prec 0.21\quad \quad \quad 0.02 \prec 0.11 \\ & P_{2} \prec P_{6} \quad \to \quad \quad \quad 0.03 \prec 0.14\quad \quad \quad 0.02 \prec 0.12 \\ \end{aligned}$$

The details of the modified domination matrix at the second stage are in Table 8.

Table 8 The sample's domination matrix at the second stage

As to the following calculations: \(P_{3}\) is dominated through \(P_{1}\) and \(P_{2}\), and is neutral in relation to P4, while it strictly dominates \(P_{5}\) and \(P_{6}\). As observed in this table, in the third row, \(P_{1}\) domination is 0, \(P_{2}\) is 0, \(P_{3}\) is 0, \(P_{4}\) is 0, and for \(P_{5}\) and \(P_{6}\) is 1.

$$\begin{aligned} & P_{3} \prec P_{4} \quad \to \quad \quad \quad 0.08 \succ 0.07\quad \quad \quad 0.03 \prec 0.06 \\ & P_{3} \prec P_{5} \quad \to \quad \quad \quad 0.08 \prec 0.21\quad \quad \quad 0.03 \prec 0.11 \\ & P_{3} \prec P_{6} \quad \to \quad \quad \quad 0.08 \prec 0.14\quad \quad \quad 0.03 \prec 0.12 \\ \end{aligned}$$

The details of the modified domination matrix at the third stage are in Table 9.

Table 9 The sample's domination matrix at the third stage

As to the following calculations: \(P_{4}\) is dominated through \(P_{1}\), \(P_{2}\), and \(P_{3}\). It strictly dominates \(P_{5}\) and \(P_{6}\), thus, in the fourth row, domination for \(P_{1}\) is 0, \(P_{2}\) is 0, \(P_{3}\) is 0, \(P_{4}\) is 0, and for \(P_{5}\) and \(P_{6}\) is 1.

$$\begin{aligned} & P_{4} \prec P_{5} \quad \to \quad \quad 0.07 \prec 0.21\quad \quad 0.06 \prec 0.11 \\ & P_{4} \prec P_{6} \quad \to \quad \quad 0.07 \prec 0.14\quad \quad 0.06 \prec 0.12 \\ \end{aligned}$$

The details of the modified domination matrix at the fourth stage are in Table 10.

Table 10 The sample's domination matrix at the fourth stage

Considering the above calculations for \(P_{5}\) and \(P_{6}\), the details of final domination are in Table 11.

Table 11 The final sample's domination matrix

As observed in Table 11, three groups are generated that would be sorted based on the obtained counts in an ascending manner; called ‘non-dominated sorting’. Accordingly, the Pareto front is expressed as follows:

$$\begin{aligned} & F_{1} = \{ P_{1} ,P_{2} \} \\ & F_{2} = \{ P_{3} ,P_{4} \} \\ & F_{3} = \{ P_{5} ,P_{6} \} \\ \end{aligned}$$

3.2.4 Crowding distance

One of the main challenges in selecting the next generation population of the wolves is the probability of having wolves of the same groups with equal ranking. In the mentioned examples, if it is required to choose half of the population for the next generation, then, \(P_{1}\) and \(P_{2}\) will be selected definitely, while the next wolf must be selected among \(P_{3}\) and \(P_{4}\). This is a difficult task because they are of equal ranking. At this stage, any crowding distance is meaningless for two points where a wolf is selected randomly among \(P_{3}\) and \(P_{4}\). The manner of converting the population into non-dominated groups is shown in Fig. 2.

Fig. 2
figure 2

Non-dominated sorting

As observed in this figure, the wolves qualified for next generation nomination are separated by the dotted line. All wolves in group 3 have the same worth, therefore, decision making on the normal selection is impossible and necessitates the application of crowding distance. This concept is applied because of more correspondence in solution distribution in a region, thus, having more obtained optimal solution in the next generation. The points with higher crowding distance will appear in the next generation, that is, the crowding distance should be separately calculated for all groups. The crowding distance for wolf \(p\) is calculated through Eq. 11 [40, 41]:

$${\text{CD}}(p) = \sum\limits_{K = 1}^{t} {\frac{{|f_{k} (p - 1) - f_{k} (p + 1)}}{{\max (f_{k} ) - \min (f_{k} )}}}$$
(11)

where t is the fitness function count, \(\max (f_{k} )\) is the highest value of function \(f_{k}\), and \(\min (f_{k} )\) is the lowest value of function \(f_{k}\).

These points have no neighbours to cover the crowding distance for each objective k function, so the infinite distance value is assigned to the points with the maximum and minimum values of the objective function. For the other points, the same equation \((i = 2,\,3,\,...,\,(n - 1))\) is applied and crowding distances are summed around each point. The crowding distance for each member of the groups is separately calculated. Only the distances between the members of each group are compared and each group is sorted according to the descending pattern. The wolves with the highest value of crowding distance are included in the next generation. The calculating pattern of the crowding distance is shown in Fig. 3.

Fig. 3
figure 3

Crowding distance calculation

3.2.5 Alpha, beta, delta, and omega group formations

One of the most outstanding features of the Grey Wolf algorithm is its memetic property. That is, each generation consists of four groups separately assessed. Elements of the groups tend to move toward the optimum based on the worth of the group and the alpha. For improving this property of this algorithm, a discrete structure is applied for the movement of the wolves. This stage requires that the population of the wolves follow a descending pattern based on their worth. There exist n wolves in each group, where wolf (element) is assigned an array structure with a length equal to the count of the extracted words. The count of the elements is divided into four (i.e., each group consisting of \({\raise0.7ex\hbox{$n$} \!\mathord{\left/ {\vphantom {n 4}}\right.\kern-\nulldelimiterspace} \!\lower0.7ex\hbox{$4$}}\) wolf). Following this process, the best member of each group is called \(X - B\).

3.2.6 The movement of the weak wolves toward the best wolf

At this stage in the evolutionary process, the location of a wolf is modified to approach the best wolf. If a wolf with less worth moves toward the location of the best wolf, where no new solution is obtained, this movement is inefficient in practise. Consequently, a discrete structure is applied for this movement. In each group, the movement is oriented toward the optimal wolf. Considering that \(X - B\) (the best wolf) and \(X - P\) (any member of the group) are of an array structure, this movement is subject to the following procedure:

The ith step element is obtained through Eq. 12. A random value within 0 and the count of words range (called UB) is assigned to \(S - Max\), and if the ith index of \(X - B\)˃ ith index of \(X - P\), a positive mutation occurs. The step element is obtained through Eq. 12:

$${\text{Step}}_{i} = \min ({\text{int}} [{\text{rand}}(1,X\_B_{i} )],S\_Max)$$
(12)

if the ith index of \(X - B\) ˂ ith index of \(X - P\), the opposite holds true through Eq. 13:

$${\text{Step}}_{i} = \max ( - 1 \times {\text{int}} [{\text{rand}}(1,X\_P_{i} - X\_B_{i} )], - S\_{\text{Max}})$$
(13)

To calculate any one of the new \(X - P\) indices, Eq. 14 is applied.

$$X\_P_{i} = (X\_P_{i} + {\text{Step}}_{i} )$$
(14)

3.2.7 Wolves reassessment

The new wolves are generated according to their mutation and all new wolves should be reassessed to determine their worth.

3.2.8 Selecting the best wolf

Ending the run of the multi-objective wolves crowding algorithm depends on the count of generations, that is, it ends after many runs and the wolf with the higher worth is selected and sent to the classification unit to be considered as the best case of subsequent grouped appropriate words.

3.3 Classification through NN

This classifier is the final in this framework where the output and input, layers, and functions of the NN are the most important factors [36]. For example, if the extracted features in the previous stage are 3, and the classification structure is of two classes, then the structure of the NN will be illustrated as observed in Fig. 4.

Fig. 4
figure 4

The NN structure consisting of 3 inputs and two classes

Note that these stages and classes are subject to the best wolf features.

  • Features mapping as the input layer

The best element (wolf) in the feature selection stage is selected by application of a grey wolf algorithm and is considered the most important factor in the NN input. This indicates that the neurons count in this layer of the NN is equal to the extracted features of the best wolf in the feature selection stage.

  • The middle layer

The neurons count in this layer is equal the neurons count of the input layer, plus one.

  • Mapping of the class count as the output layer

Here, the classification structure is two-class or three-class. The neurons count of the output layer would be 2 or 3 neurons, provided that the proposed method data are of the two or three classes. Here, each neuron has 0 or 1 value. The output of the first neuron would be 1, and the second 0, if the NN recognizes the features are of class 1.

4 Experiments and results

MATLAB S/W is applied to evaluate the efficiency of this proposed framework. The K-fold technique with tenfold cross-validation is applied to improve the accuracy of the evaluation.

4.1 Test environment and evaluation parameters

Windows 10 is the test environment, and the specifications are tabulated in Table 12.

Table 12 Testing enviroment specifications

The four important metrics in classification evaluation proposed in this article consist of accuracy, precision, recall, and f-measure. Accuracy is the sum of actual tuples relative to the total number of classified instances. The precision and recall are separately calculated for all classes and then averaged. The f-measure is the weighted harmonic mean of precision and recall. According to the evaluations run here, the Mean of Square Error (MSE), Sum of Squares for Error (SSE), and Determination Coefficient (\(R^{2}\)) error indices are evident. In \(R^{2}\), y is the actual value, and \(\overline{{y_{i} }}\) is the predicted value. All tabulated evaluation metrics are in Table 13 [37].

Table 13 Evaluation parameters

4.2 The used datasets

The following three datasets are applied in evaluating this proposed framework performance; Table 14.

  1. 1.

    Cornell movie review [27], where the polarity movie dataset (PMD) of 1000 positive and 1000 negative reviews, are extracted from IMDB, which are of concern and generated datasets in sentiment analysis.

  2. 2.

    User comments on other user tweets on Twitter; the Sentiment140 is obtained from 127,000 comments from 73,100 users. Here, the comments are grouped in positive, negative, and neutral. This text corpus structure is known as TS3.

  3. 3.

    User comments on other user tweets on Twitter; the Sentiment140-2 are obtained from 74,000 comments from 31,820 users. The comments, TS2, are grouped in positive and negative, and this text corpus structure is known as TS3. The data are provided from Stanford University data [42].

Table 14 A sample of the structure of two records in datasets

4.3 The comparison

The closest methods to this framework are the FFSVM [16] and FS-BPSO [17] where the metaheuristic algorithms, including Firefly and PSO algorithms are applied. The reason for choosing these methods is their close correspondence with this framework. The feature selection stage of the two methods is separately explained. The main parameters of implemented algorithms are in Table 15. The total features' count of the best elements of the three algorithms is tabulated in Table 16.

Table 15 Main parameters of implemented algorithms
Table 16 The total features' count of the best elements

FFSVM: The mutual information criterion is applied to extract features. The feature selection is accomplished through the discrete Firefly algorithm. The initial population is generated as a binary array of bits with a length equal to the features count and define the population size. The initial positions are generated in a random manner as a number within 0–1. For each element in the population, the fitness function is calculated based on the accuracy of the binary classifier SVM. The Fireflies of less worth move toward Fireflies of more worth. For updating the position, the researchers changed the optimization form to the discrete state. The new positions are calculated based on this change. The best Firefly will be selected based on the fitness function. Finally, the best position is saved, and other positions of less worth are deleted. This process will terminate when the next obtained generation equals to the maximum generation, Kumar and Khorwal [16].

FS-BPSO: The mutual information criterion is applied to evaluate features in binary SC and the feature selection is accomplished through a binary PSO algorithm for binary SC. The initial population is generated as a feature vector for the velocity and the position of particles in a random manner. This array with a size equal to features count is filled with 0 or 1. The researchers proposed an updated equation to evaluate both position and velocity. They applied a fitness sum for particles, which is divided into two groups where the group with the higher sum is selected. They applied a mutation rate to assure the convergence. The solutions are evaluated by the new positions of the particles. This process is repeated until the maximum generation is met, Shang et al. [17].

In this study, the objective is a feature selection as to reduce the count of features, which in turn would increase the classification accuracy in multi-class SC. The initial population is generated as an array of 0 or 1 in a random manner with a length equal to important features. To decrease dimension, half of the population is selected in a random manner, and only bits with a value equal to 1 are of concern. For each element of the initial population in the MOGW algorithm, TF, IDF, and ICF weights are involved. The KNN and NB are applied to classify the elements based on these weights. The obtained error rate of classification is calculated through the regression function because the framework is supervised. Now, for each wolf (element) in the population, there exist two that have worth (two objectives). The best wolf is selected based on the non-dominated sorting and sum of the crowding distance. When the wolves cannot dominate, and for the wolves with the same level of domination, the crowding distance is applied. The equations for the movement of the wolf positions are updated based on the discrete structure, and after generating several generations, a wolf will be extracted as the best feature that dominates other wolves (one with less domination value, or the more crowding distance with equal domination worth).

4.4 Evaluation of the MOWGOKB framework

This framework is compared to the FFSVM [16] and FS-BPSO [17], which is implemented on the three datasets. The x-axis illustrates the compared models in tenfold cross-validation, whereas the y-axis shows the evaluation parameters (i.e., accuracy, precision, and recall values); Figs. 5, 6, 7, 8, 9, 10, 11, 12, 13.

Fig. 5
figure 5

The precision value of the MOWGOKB framework compared to the FFSVM and FS-BPSO methods on the PMD dataset

Fig. 6
figure 6

The accuracy value of the MOWGOKB framework compared to the FFSVM and FS-BPSO methods on the PMD dataset

Fig. 7
figure 7

The recall value of the MOWGOKB framework compared to the FFSVM and FS-BPSO methods on the PMD dataset

Fig. 8
figure 8

The precision value of the MOWGOKB framework compared to the FFSVM and FS-BPSO methods on the TS2 dataset

Fig. 9
figure 9

The accuracy value of the MOWGOKB framework compared to the FFSVM and FS-BPSO methods on the TS2 dataset

Fig. 10
figure 10

The recall value of MOWGOKB framework compared to the FFSVM and FS-BPSO methods on the TS2 dataset

Fig. 11
figure 11

The precision value of the MOWGOKB framework compared to the FFSVM and FS-BPSO methods on the TS3 dataset

Fig. 12
figure 12

The accuracy value of the MOWGOKB framework compared to the FFSVM and FS-BPSO methods on the TS3 dataset

Fig. 13
figure 13

The recall value of the MOWGOKB framework compared to the FFSVM and FS-BPSO methods on the TS3 dataset

4.4.1 The PMD dataset

The precision value of this framework is in comparison with two methods, Fig. 5, where its higher precision is evident. The comparison of its accuracy is shown in Fig. 6, where its higher accuracy is evident. Accuracy is the most outstanding index in every method. Similar to the precision and accuracy, the MOWGOKB recall with two methods is evident in Fig. 7 where it is observed that this framework provides higher recall, thus, the more the precise feature extraction, the more the precise classification. The highest obtained precision, accuracy, and recall values of this framework are 95.76%, 95.21%, and 95.99%, respectively. The results have improved by an approximate 4% more than that of the FS-BPSO and 2% of FFSVM. The outperformance of this framework is evident by comparing the three methods. The details of this comparison as to error indices on the PMD dataset are in Table 17.

Table 17 Error Indices comparison is made based on the PMD dataset

4.4.2 The TS2 dataset

The precision value is compared with two methods, Fig. 8, where it is revealed that this is more precise than the other two are, and the same for accuracy is shown in Fig. 9, where its higher accuracy is evident. Accuracy is an outstanding index in any method. Similar to the precision and accuracy, the MOWGOKB recall with two methods is evident in Fig. 10, where it is observed that this framework provides higher recall, thus, the more the precise feature extraction, the more the precise classification is. The highest obtained precision, accuracy, and recall values of this framework are 95.72%, 95.75%, and 95.93%, respectively. The results are improved by an approximate 3% more than that of the FS-BPSO and 1% of FFSVM. Outperformance of this framework is evident by comparing the three methods. The details of the MOWGOKB framework error indices compared with the two methods above on the TS2 dataset are in Table 18.

Table 18 Error indices comparison is made based on the TS2 dataset

4.4.3 The TS3 dataset

The precision value compared with the two methods is shown in Fig. 11, where the higher precision is evident. The comparison of its accuracy is shown in Fig. 12, where its higher accuracy is evident. The accuracy (recognition percentage) is an outstanding index in any method. Similar to the precision and accuracy, the MOWGOKB recall with two methods is evident in Fig. 13, where it is observed that this framework provides higher recall, thus, the more the precise feature extraction, the more the precise classification is. The highest obtained precision, accuracy, and recall values of this framework are 94.98%, 94.39%, and 94.77%, respectively. The results are improved by an approximate 3% more than that of the FS-BPSO and 1% of FFSVM, so comparing the three methods, the outperformance of this framework is evident. The details of the MOWGOKB framework error indices compared with the two methods above on the TS3 dataset are in Table 19.

Table 19 Error indices comparison is made based on the TS3 dataset

According to Tables 17, 18, 19, as to the three datasets, it is revealed that the error indices of the MOWGOKB framework are better than its counterparts. The obtained precision, accuracy, and f-measure by the three works, in tenfold cross-validation on all datasets, are compared in Table 20 and as observed, the results are better than that of its counterparts.

Table 20 The details of the MOWGOKB framework compared to the FFSVM and FS-BPSO methods in tenfold cross-validation on three datasets

5 Discussion

As observed in Figs. 14, 15, 16, the evaluation parameters of the mean performance for the compared models are represented on the x-axis, and the obtained values are represented on the y-axis. The results of different evaluation parameters (i.e., the average accuracy, precision, recall, and f-measure) of the MOWGOKB, in comparison with other models for all datasets, are shown in the same figures.

Fig. 14
figure 14

Mean performance of the MOWGOKB framework compared to the FFSVM, and FS-BPSO methods for the PMD dataset in terms of the accuracy, precision, recall, and f-measure

Fig. 15
figure 15

Mean performance of the MOWGOKB framework compared to the FFSVM and FS-BPSO methods for the TS2 dataset in terms of the accuracy, precision, recall, and f-measure

Fig. 16
figure 16

Mean performance of the MOWGOKB framework compared to the FFSVM and FS-BPSO methods for the TS3 dataset in terms of the accuracy, precision, recall, and f-measure

Overall performances of the PMD dataset are shown in Fig. 14, where the MOWGOKB framework presents the average accuracy and f-measure of 94.62% and 94.85%, respectively. The FFSVM model, by applying the Firefly algorithm, represents the average accuracy and f-measure of 92.77% and 92.91%, respectively. The FS-BPSO method obtained average accuracy and f-measure values of 91.24% and 91.15%, respectively. The FFSVM model has a better performance than that of the FS-BPSO, yet to exceed this framework. This framework provides accuracy by approximately 3% more than that of the FS-BPSO and 2% of FFSVM. It is revealed that this framework outperforms the metaheuristic algorithms on the PMD dataset.

Overall performances of the TS2 dataset are shown in Fig. 15 where the average accuracy and f-measure values of the MOWGOKB framework are 94.89% and 95.08%, respectively. The average accuracy and f-measure of the FFSVM method are 93.07% and 93.28%, respectively, while the FS-BPSO method where the binary PSO algorithm is applied, obtained the average on accuracy and f-measure values are 91.25% and 91.58%, respectively.

The FFSVM model provides better performance than the FS-BPSO in the TS2 dataset, yet to exceed this framework. The average f-measure obtained through the MOWGOKB framework is approximately 2% higher than that of the FFSVM method, indicating its outperformance against its counterparts on the TS2 dataset. The average accuracy is improved by an approximate 4% more than that of the FS-BPSO and 2% of FFSVM.

Overall performances of the TS3 dataset are shown in Fig. 16 where the obtained average accuracy and f-measure of the MOWGOKB framework are 93.88% and 92.16%, respectively. The average accuracy and f-measure of the FFSVM method are 92.26% and 92.35%, respectively. However, the FS-BPSO method obtained average accuracy and f-measure values of 90.75% and 90.57%, respectively.

The FFSVM model provides better performance in the TS3 dataset than the FS-BPSO, yet to exceed this framework. The average accuracy is improved by an approximate 4% more than that of the FS-BPSO and 2% of FFSVM.

The advantage of this framework on the three datasets is evident in the provided figures, which reveal that the MOWGOKB framework yields better accuracy, precision, recall, and f-measure average for multi-class SC by applying an apocopate feature selection and decreased dimension. It is observed that feature selection, through a discrete grey wolf algorithm, yields better results than the Firefly and binary PSO algorithms applied in the FFSVM and FS-BPSO, respectively. In contrast to these two methods, in this framework TF, IDF, and ICF weights are applied in pre-processing. The findings of this framework, against that of the FFSVM and FS-BPSO methods, indicate that a combination of the discrete MOGW algorithm with these two objectives alongside applied weighting mechanisms is better than the Firefly algorithm applied in the FFSVM and the binary PSO algorithm applied in the FS-BPSO methods. The findings also show by reviewing the results that more parameters are considered here than other methods for the feature selection stage; hence, it is more successful than its counterparts are. The calculated error rates through these two objectives determine the importance of worth in selecting the best wolf (element). The reason for the obtained better results is in applying the two monitoring levels of less domination and more crowding distance.

In brief, this framework provides a better feature selection than its two counterparts do and can improve the performance of multi-class SC. It is deduced that the MOWGOKB framework provides the highest performance for all datasets, while the FFSVM model provides the second best.

6 Conclusion

People purchase products on the Internet and express their opinions thereof every second. These opinions are useful in the financial statements in the related outlets. With the explosion of information on the Internet, it is difficult for ordinary people to make decisions about products. SC is a significant field in sentiment analysis that can assist e-relations. The rapid growth in electronic text documentation makes such analysis important when it comes to information recovery. Keywords are the useful tools applied in searching for a high volume of text documentation in a short period, and their extraction is the focus of many researchers.

A new framework called MOWGOKB is introduced for multi-class SC based on a discrete MOGW algorithm with two objectives of decreasing the error of the KNN and NB classifiers. The NN classifier is applied as the final classifier. Here, user comments are first tokenized into sentences. The given sentence is decomposed, the words are stemmed, and the ‘stop words’ are eliminated. The most important words (features) are selected based on the two NB and KNN error reduction objectives through the MOGW algorithm. The final classification is made through the NN. For evaluation, the three PMD, TS2, and TS3 datasets are applied. The MOWGOKB framework is compared with FFSVM and FS-BPSO methods. The obtained results indicate that this proposed framework outperforms its counterparts on the movie dataset, with 95.76% precision, 95.21% accuracy, 95.99% recall, and 95.15% f-measure rates, which compared to other methods, represents about 4% improvement. The obtained results indicate that this framework outperforms its counterparts on the Twitter dataset, with a 95.72% precision, 95.75% accuracy, 95.93% recall, and 95.82% f-measure rates, which compared to other methods, represents about 4% improvement. In this context as suggested future work, a combination can be considered of a multi-objective algorithm with sequence pattern mining.