1 Introduction

People want to know the value of a product before purchase. Whether a meal at a restaurant, a new movie, or a blender. In the Internet age, online reviews are essential for determining this value. The growth of e-commerce presents another problem: quantity of reviews. Reading reviews requires time and effort from the consumer. With hundreds of reviews, this effort becomes a burden. While aggregate ratings alleviate this problem, they do not provide detailed information. One individual may be concerned with the atmosphere of a restaurant, while another cares only about the quality of the fish they serve. Most Web sites provide only a single overall numeric rating, which gives no information about specific quality factors.

With Gist, we summarize text reviews into a few key sentences that capture the overall sentiment about the product. By presenting natural sentences, we provide easy-to-digest information that is not expressed by overall ratings. To avoid the difficult problem of generating natural language for our text summary, known as text abstraction (Carenini et al. 2013; Fiszman et al. 2004; Hahn and Mani 2000), we use sentences from the text as our summary. This technique is known as text extraction (Carenini et al. 2013; Hahn and Mani 2000; Nenkova and McKeown 2012; Ku et al. 2006; Goldstein et al. 2000) and is currently the premier method in the state of the art (Nenkova and McKeown 2012). Unsupervised learning allows Gist to summarize any product without prior knowledge or training. The result is a generalizing system that effectively summarizes any set of reviews in any language.

Gist extends the state of the art with easy customization and rapid adaptation. Gist is a modular text summarization framework. Rigid systems inevitably lack requirements for specific domains. Rather than attempting to create a catch-all system, we design Gist to allow changes within the existing system while focusing on core features that are essential for text summarization. To that end, we do not implement features that focus on specific parts of speech, but provide a framework, which makes such modification easy.

Section 2 discusses text summarization background and related works. Section 3 provides an overview of the Gist algorithm. Section 4 describes sentiment analysis and how it relates to Gist. Section 5 describes the term frequency, inverse document frequency algorithm for relevance analysis, and how it is used in Gist. Section 6 provides details on extending Gist and presents two minor attributes that we easily add to Gist. Section 7 provides a fitness function utilizing the discussed attributes. Section 8 is a discussion of various metaheuristic search methods, leading to our use of metaheuristic search for Gist. Section 9 examines performance optimizations for Gist. Section 10 compares Gist to state-of-the-art summarization algorithms on datasets containing hundreds of documents, using Rouge. We conclude with Sect. 11.

2 Related works

Hu and Liu (2004) developed a review summarization system that focuses on rating features with association rules and sentiment analysis (Yi et al. 2003; Wilson et al. 2005; Pang and Lee 2008, 2004). Microsoft developed a similar system for restaurant reviews presented in Nguyen et al. (2007). Li et al. (2010) used syntactic tree structures and conditional random fields to achieve similar review summarization. These systems take a structural approach to review summarization, explicitly analyzing text to discover features before summarizing these features. The result is similar to a spreadsheet of feature/value pairs. When features are properly extracted, this system is effective. However, this makes feature extraction a weak point that results in odd or confusing summarization when a set of reviews is not structured as expected. This rigid system is also difficult to adapt to specific considerations and domains. With the proposed Gist, we take a more flexible approach that avoids the difficult and error-prone task of feature extraction.

Although we focus on review summarization, the Gist framework supports general text summarization and allows easy modification and extension for any text domain. While review summarization systems are rare in the state of the art, text summarization is plentiful. Supervised learning methods for text summarization gained popularity in the early 2000s with the work of Turney (2000). While the supervised approach still finds use, unsupervised methods quickly became the state of the art when the TextRank algorithm (Mihalcea and Tarau 2004) proved versatile and effective. Recent works focus primarily on improving performance in specific domains (Rastkar et al. 2014; Chua and Asur 2013) or multi-document summarization (Carenini et al. 2013), rather than exploring new approaches to summarization.

State-of-the-art summarization is dominated by two practical tools: MEAD (Radev et al. 2004; Saggion and Poibeau 2013) and SUMMA (Saggion and Poibeau 2013; Saggion 2008, 2014). MEAD is a text summarization framework, within which numerous particular algorithms are implemented, such as MEAD* (Carenini et al. 2013; Gerani et al. 2014) for multi-document summarization and MEAD-LexRank (Gerani et al. 2014) implementing the popular LexRank algorithm (Erkan and Radev 2004). SUMMA is a set of text summarization resources built on the GATE (Maynard et al. 2002) text summarization framework.

The Gist framework plays to the requirements of modern summarization by allowing easy and rapid adaptation to any domain. We also note that many of these summarization techniques can be easily incorporated into the Gist framework and work alongside or instead of the core attributes presented in the following sections.

3 The Gist algorithm

Fig. 1
figure 1

Phases of Gist

Table 1 Calculated attributes of example sentences

Gist is composed of a few disjoint components (presented in Fig. 1) that effectively summarize when brought together. The core of Gist is a set of attributes calculated from and assigned to each sentence from the set of text or reviews being summarized. Each attribute is a numeric value calculated from the text in a sentence. A sentence is defined as a sequence of characters followed by a stop character like period, exclamation mark, or question mark, then followed by whitespace. Of these attributes, the two most important are sentiment and relevance. However, any number of attributes can be assigned to sentences, and special attributes can be calculated for specific domains to utilize a priori knowledge. With these sentence attributes, we perform metaheuristic search with a fitness function that examines the attributes of N sentences and selects the N sentences that best summarize the set of text. These N sentences are then returned as the summary. The ability to add attributes with minimal effort allows one to easily extend Gist for new domains or specific considerations. For example, if trying to summarize the atmosphere of a restaurant, one can insert an attribute that gives extra fitness to sentences that mention atmosphere. For clarity, we provide a step-by-step example:

  1. 1.

    Fetch This step simply obtains the text and is domain specific. In this example, the text consists of only 5 sentences. For demonstration purposes, these imaginary sentences are simply referred to by the labels \(s_1 \cdots s_5 \). Real text can easily contain hundreds of thousands of sentences.

  2. 2.

    Calculate attributes Table 1 displays the calculated attributes for our example sentences. For this demonstration, attribute values are fabricated. The following sections explain how these attributes are calculated for real sentences.

  3. 3.

    Cull Table 2 compares each sentence’s attributes to the given threshold values. The sentiment magnitude threshold \(t_m =0.2\), and the relevance threshold \(t_r =0.5\). Note that sentiment has no culling threshold. All attributes of a sentence must be greater than or equal to the corresponding thresholds, or the sentence is culled. We see that sentence \(s_4 \) is removed due to low sentiment magnitude m and \(s_5 \) is removed due to low relevance r.

  4. 4.

    Optimize For this example, our goal is to find 2 sentences from the text that most closely match the overall sentiment of the text, which we define as − 0.2 for this demonstration while maximizing relevance and sentiment magnitude. This goal is formalized as the fitness function \(f\left( N \right) =10C\left( N \right) +3m\left( N \right) +6r\left( N \right) \), where N is a set of sentences, \(C\left( N \right) \) is the sentiment closeness score (1) defined in the next section, \(m\left( N \right) \) is the average sentiment magnitude of N, and \(r\left( N \right) \) is the average relevance of N. Note that this goal is easily modifiable. Table 3 displays the fitness value and calculation for each sentence not culled. Although the set of sentences \(\left( {s_1 ,s_2 } \right) \) most closely fit the overall sentiment, the greater relevance of \(s_3 \) results in a higher fitness score for \(\left( {s_2 ,s_3 } \right) \). A metaheuristic search algorithm will explore the space of sentence combinations, to find the set with the highest fitness. For this example, we can easily see that the set \(\left( {s_2 ,s_3 } \right) \) is the best combination.

Table 2 Comparison of attributes and culling thresholds
Table 3 Calculated fitness of sentence combinations

4 Sentiment analysis

At its core, sentiment analysis (Yi et al. 2003; Wilson et al. 2005; Pang and Lee 2008, 2004) is a dictionary, which maps words to sentiment polarity. Words like “great”, “good”, and “awesome” have a positive sentiment, while words like “terrible” and “bad” have a negative sentiment. Many words have no sentiment at all, such as “the”, “and”, “a”. The greatest difficulty lies in generating this dictionary. In this paper, we do not focus on the generation of a sentiment dictionary. For Gist, we use the expansive WordNet lexical database (Miller 1995). Beyond the core sentiment dictionary, pattern analysis provided by the TextBlob library (Loria 2014) allows for more accurate sentiment classification by recognizing linguistic constructs such as negation, and in some cases sarcasm (González-Ibánez et al. 2011).

It is worth discussing the disadvantages of sentiment analysis, both in the current state of the art and in general. First, we make no claims to the optimality of our sentiment analysis. Certain forms of speech, such as sarcasm and unusual use of words, can result in a misclassification of sentiment. Even if our sentiment analysis is perfect, the issue of subjectivity arises. To perfectly classify sentiment, the analysis must tailor its behavior to an individual. Language is a complex construct, and everyone interprets words and sentences differently. A sentence that is very positive to one individual could be negative to another, based on culture or personal experience. Even human judges perform poorly at sarcasm classification (González-Ibánez et al. 2011). As it is unreasonable to tailor our sentiment analysis to every individual, we must settle for an imperfect system. However, since sentiment analysis in Gist acts only as heuristic guidance, Gist can function well in the intended environment.

4.1 Sentiment attribute in Gist

Sentiment analysis plays a key role in Gist. Reviews are about feeling. One review may have great things to say about a product, which is a positive sentiment. Another may hate the product, which is a negative sentiment. Our first step to summarizing a set of reviews is determining the overall sentiment of the reviews. Note that the sentiment analysis procedure described in Sect. 4 can be applied to the entire set of text to obtain the overall sentiment. Next, we obtain the sentiment for every sentence. By examining the average sentiment of a set of N sentences, we can determine how close the N sentences match the overall sentiment of the product. As such, a product with generally positive reviews will have a positive summary, and vice versa. This is formalized in Eq. (1):

$$\begin{aligned} C(N)=\frac{1}{\left( P-S_\mathrm{avg} (N)\right) ^{2}+1}, \end{aligned}$$
(1)

where P is the overall product sentiment and \(S_\mathrm{avg}(N)\) is the average sentiment of a set of N sentences. With this function, \(C=1\) when average sentiment matches P. As average sentiment diverges from P, C approaches 0.

We also examine sentiment magnitude, the absolute value of sentiment. This heuristic leads us to sentences that are well opinionated. When generating a summary, we want sentences that tell us something about the product, which is to say, we want opinions. A sentence such as “I ordered the chicken” tells us less than “The chicken was wonderful”.

5 TF–IDF

The term frequency (TF) component of TF–IDF dates back to 1957 with Luhn (1957). The concept is simple, the more a word is used in a set of text, the more relevant it is to that set of text. This is formalized in Eq. (2):

$$\begin{aligned} \hbox {TF}(w) = (\hbox {count of}\, w\, \hbox {in} \,T) / (\hbox {count of terms in}\, T), \end{aligned}$$
(2)

where w is a word (or term) and T is a set of text. Term frequency alone would give disproportionate weight to commonly used words such as “and”, “the”, “a”. Even though these words show up frequently in any set of text, they are not relevant because they are simply necessary grammar. The fact that they are omnipresent allows us to solve this problem with inverse document frequency (IDF), first introduced in Sparck Jones (1972). Like TF, IDF is conceptually simple. The more documents a term appears in, the less relevant it is for any particular document. This is formalized in Eq. (3):

$$\begin{aligned} \hbox {IDF}(w)= & {} \hbox {ln}(\hbox {count of documents} / \hbox {count of documents} \nonumber \\&\quad \hbox {containing}\, w).\nonumber \\ \end{aligned}$$
(3)

If a term appears in every document, its IDF value is 0, while a term that appears in only one document can easily have a value in the double digits. Combining these two concepts, we obtain the TF–IDF algorithm:

$$\begin{aligned} \hbox {TF}-\hbox {IDF}(w) = \hbox {TF}(w) \times \hbox {IDF}(w). \end{aligned}$$
(4)

Conceptually simple, and dating back to the mid-1990s, this algorithm is proven effective and timeless, with numerous examples of modern usage (Chum et al. 2008; Zhang et al. 2011; Arandjelović and Zisserman 2012).

5.1 Relevance attribute in Gist

If we only examine sentiment when generating a summary, an issue arises. For the “McDonalds” restaurant, the summary could include “Star Wars is a really great movie”. This sentence is completely irrelevant to the restaurant. To eliminate such sentences, we use the TF–IDF algorithm given in Eq. (4). For our document corpus, we use the well-tested Reuters-21578, Distribution 1.0 (Lewis 1997; Debole and Sebastiani 2005), containing 21578 documents.

The core of TF–IDF is a dictionary of word values. To obtain this dictionary, we must calculate the TF value of every word in the set of reviews we are summarizing (Eq. 2) and the IDF values for our corpus of documents (Eq. 3). Next, using Eq. (4), we obtain the relevance R for sentence S with m words, as in Eq. (5):

$$\begin{aligned} R\left( S \right) =\frac{\mathop \sum \nolimits _{w\in {S}} \hbox {TF}-\hbox {IDF}\left( w \right) }{m}. \end{aligned}$$
(5)

In short, R(S) is the average TF–IDF score of all words, w, in sentence S. Normalizing by the number of words, m, allows us to obtain a relevance value independent of the size of the sentence, thereby providing more effective relevance comparison. This attribute allows us to penalize outlier sentences and reward sentences that closely match the average language used in the set of reviews.

6 Extensibility

Review summarization is a narrow view of the possible applications of Gist. The ability to easily add attributes allows Gist to extend to other domains without reinventing the core system. General text summarization is already possible with the core Gist attributes.

For example, one can adapt Gist to extract sentences mentioning recent news from social media. Each set of text can be taken from different social media posts, or from many posts from each person. An attribute that gives great weight to sentences containing phrases related to recent news, and a filter for these phrases extend Gist toward the domain of news summarization. Combined with the core Gist attributes, this system could find sentences that fit people’s views of recent news. If interested in only positive or negative views, a corresponding sentiment filter can be added.

We note that, while the formulation in Fig. 1 and Sect. 7 specifies a linear combination of attributes, nonlinear attributes are possible. Attributes can be calculated from a set of sentences, during the optimization phase, at the expense of performance. The sentiment attribute in Sect. 4.1 is an example of a nonlinear Gist attribute.

It is also easy to improve Gist’s effectiveness as a general text or review summarization system. As an unsupervised learning system, the effectiveness of Gist summarization is determined by user opinion. Although we have limited resources to thoroughly test the effectiveness of various combinations of attributes, if more are added, user opinion can be obtained on whether the attributes improve summarization ability. This step could be repeated multiple times to optimize text summarization. Giving greater weight to sentences with certain parts of speech is an example of an attribute that may improve text summarization.

6.1 Minor attributes in Gist

Beyond the primary attributes, a number of minor attributes provide additional direction for Gist. For one, Gist grants additional fitness to sentences that mention the name of the product, or to a lesser extent, part of the name. This reduces the chance of generating a summary with sentences that are out of context. The sentence “Star Wars is a really great movie” is given greater weight than “That is a really great movie”.

By examining the length of a sentence, we provide heuristic guidance toward sentences that fit the concept of a summary. Short sentences are penalized for not providing enough information, while long sentences are penalized for being too verbose. The optimal sentence length is empirically determined to be 15 words. This conclusion is reached by calculating the average sentence length from hundreds of reviews that are deemed very helpful by the Yelp community.

7 Fitness function

Each attribute presented in Sects. 46 provides a value correlating with the effectiveness of a sentence or set of sentences for summarization. These attributes consider sentiment, relevance, length, and title (name) of the document or product. Further attributes can be easily added to consider other aspects of a good summary. Because each attribute individually scores the effectiveness of a sentence or set of sentences for summarization, combining these attributes with weighted summation provides an effective fitness function.

This fitness function must first combine individual sentence attributes into a combination value for each attribute. Some attributes are combined as an average of individual sentence attributes; others are a nonlinear combination. A weighted summation of attribute combination values then provides the final fitness value. Our attribute weights are given in Table 4. Note that each attribute is defined in Sects. 46. This weighted sum benefits from reliable value ranges. Apart from relevance, all attributes fall within a predictable range. For relevance values to be more predictable, we normalize from 0 to 1. Algorithm 1 presents this fitness function for a set of N sentences.

Table 4 Attribute weights
figure a

W(A) is the weight of attribute A, W(C) is the weight of the sentiment closeness attribute, and \(\hbox {A}_{\hbox {avg}} \left( \hbox {N} \right) \) is the average value of attribute A for sentences in N. Of our attributes, only the sentiment closeness function C(N) (1) is a nonlinear combination. As described in Sect. 6, Gist can be extended with additional nonlinear attributes.

Since attributes are pre-calculated for every sentence, the fitness function is very fast, requiring only an addition operation for every attribute of every sentence and a multiplication operation for every attribute, with the exception of C(N), which scales linearly with the size of N. Fitness function scaling is formalized as \(\hbox {O}\left( {ak} \right) \), where a is the number of attributes and k is the size of N.

8 Metaheuristic search

The ability to traverse a general fitness landscape makes metaheuristics a powerful search tool. Each metaheuristic uses the same fitness function, defined in Sect. 7. This fitness function takes a set of sentences and returns a single number that defines the quality of the given sentences with regard to summarizing the text. In Gist, this fitness value is determined by a combination of sentence attributes such as sentiment, TF–IDF relevance, and other values defined by a user. The generality of metaheuristic search lets us change the definition of a good sentence by adjusting the fitness function, without modifying the process that discovers the best sentences for summarization.

Many metaheuristics exist to perform search. We explore a popular subset of metaheuristics. The classic genetic algorithm (GA) remains an effective search method (Deb et al. 2002; Yang and Honavar 1998; Uğuz 2011). More recently, metaheuristic methods rooted in mathematics have grown in popularity. Particle swarm optimization (PSO) (Kennedy and Eberhart 1995) is now a staple of metaheuristic methods. The recently developed gravitational search algorithm (GSA) (Rashedi et al. 2009) is functionally similar to PSO, but is inspired by physics rather than nature. The Gist fitness function works with most metaheuristics. As such, the choice of metaheuristic affects only the efficiency of the search for the best set of sentences. This is an interchangeable component that does not affect the core process of Gist.

8.1 Genetic algorithm

The GA (Deb et al. 2002; Yang and Honavar 1998; Uğuz 2011), inspired by evolution in nature, is one of the oldest metaheuristics. Despite its age, this algorithm is one of the most popular metaheuristics in use today, a testament to its effectiveness. One advantage of the GA is extensibility. The use of interchangeable crossover, selection, and mutation functions allows one to tailor the GA to a particular problem.

Selecting candidate solutions, or chromosomes, based on fitness creates pressure to improve fitness and forms the core of the exploitation component of this metaheuristic. Selected chromosomes are then combined to explore potential solutions in the portion of the problem space that high-fitness solutions occupy, adjusting potential solutions toward higher fitness and exploring new space. Mutation maintains diversity in the chromosome population by randomly adjusting chromosomes independent of fitness.

In this paper, we do not attempt to improve upon the GA literature. Instead, we make use of the staples already developed. One-point crossover takes two parent chromosomes and generates two children chromosomes by taking one part from each parent, according to a randomly selected crossover point, displayed in Fig. 2.

Fig. 2
figure 2

One-point crossover

Roulette selection randomly selects parents for crossover, with a probability of selection proportional to the fitness of the chromosome. Bit string mutation examines every bit of every child chromosome and, with probability p, flips the bit from 0 to 1, or 1 to 0. Conversely, with probability \((1 - p)\) the bit remains unchanged.

8.2 Particle swarm optimization

Like GAs, PSO (Kennedy and Eberhart 1995) is inspired by nature, but is less dependent on controlled randomness. By studying the swarming behavior of birds and fish, Kennedy and Eberhart (1995) developed this algorithm to mimic the social behavior of animals that can scatter, change direction suddenly, and regroup to discover an optimal path.

PSO initializes a population of particles at random positions in a problem space. Random acceleration vectors drive these particles to explore new solutions. Acceleration toward high-fitness particles, and the best known solution, allows for the exploitation of known solutions in an effort to find small adjustments that increase fitness. Figure 3 shows how random acceleration and acceleration toward high-fitness areas may combine to give the movement of low-fitness particles through a problem space.

Unlike GA, PSOs small adjustments of a previous population allow for fine-grained exploration of continuous problem spaces. Even without a continuous problem space, PSOs behavior causes low-fitness particles to quickly swarm toward high-fitness solutions, allowing rapid adjustment. However, this high exploitation comes at the cost of potentially converging to local minima.

8.3 Gravitational search algorithm

GSA (Rashedi et al. 2009) has many similarities to PSO. Like it, iterative adjustments are made to an initially random population, as opposed to generating completely new solutions from iteration to iteration as GA does. Also like PSO, GSA draws low-fitness solutions toward high-fitness solutions as a form of exploitation. Unlike PSO, GSA is inspired by the gravitational motion of bodies in free space, as dictated by Newton’s laws.

Fig. 3
figure 3

Particles drawn toward high fitness (https://commons.wikimedia.org/wiki/File:Contour2D.svg#/media/File:Contour2D.svg)

In GSA, every potential solution has a mass proportional to its fitness. As in natural physics, bodies attract one another, higher-mass bodies attract more and are less affected by the attraction of other bodies. This mechanism alone would quickly cause all bodies to converge toward the highest fitness bodies. Unless the global optima lie along the path of one of these bodies, this would result in premature convergence to a local optimum. To encourage exploration of the problem space, the summation of force on a body and the velocity update for a body is randomized. Specifically, the total force on a body in one dimension is given as:

$$\begin{aligned} F_i^d (t)=\mathop \sum \limits _{j=1}^{N} { rand}_j F_{ ij}^d (t), \quad \hbox {where} j\ne i, \end{aligned}$$
(6)

where \(F_i^d \left( t \right) \) is the total force on body i in dimension d at time step t, \(rand_j \) is a random number between 0 and 1, \(F_{ij}^d \left( t \right) \) is the force of body j on body iin dimension d at time step t, and N is the number of bodies. As such, the force that each body applies to every other body in each dimension is randomized. Depicted in Fig. 4 are two bodies of equal mass, equidistant in the x and y dimensions. The force for each dimension is equal according to Newton’s laws, unlike GSA.

The velocity update is likewise randomized for each body by taking a random fraction of the velocity in each dimension and adding its acceleration in that dimension, to obtain the velocity for the next time step. This is formalized as:

$$\begin{aligned} v_i^d \left( {t+1} \right) ={ rand}_i v_i^d \left( t \right) +a_i^d \left( t \right) , \end{aligned}$$
(7)

where \(v_i^d \left( t \right) \) is the velocity of body i in dimension d at time step t, \({ rand}_i \) is a random number between 0 and 1, and \(a_i^d \left( t \right) \) is the acceleration of body i in dimension d at time step t. Presented here are the equations that most differentiate GSA from standard Newtonian motion. For a more in-depth presentation of the GSA algorithm, see Rashedi et al. (2009).

Fig. 4
figure 4

Movement of bodies in GSA

9 Performance enhancements

The system presented thus far relies on a metaheuristic without guarantee of finding the optimal solution in a potentially large problem space. This means, for very large problems, if most sentences are poor with regard to the fitness function, the summary may be poor. To more consistently discover the optimal set of sentences, we must reduce the problem space by culling poor sentences, thereby optimizing the metaheuristic.

9.1 Culling

Every sentence adds to the search space of the metaheuristic search process. However, many sentences can easily be deemed unsuitable due to the value of a single attribute. Completely neutral, irrelevant, or extremely long sentences can immediately be discarded from the search space, thereby saving time and increasing the chance of finding the optimal set of sentences.

Table 5 Statistical significance of metaheuristic comparison (ANOVA)
Table 6 Metaheuristic comparison

To this end, we introduce a technique we call intermediate culling. As each attribute A of sentence S, given as A(S), is calculated, it is compared to a threshold. If the attribute value does not pass the threshold, the sentence is immediately discarded. No additional attributes are calculated for the discarded sentence. Otherwise, the next attribute is calculated, and this process repeats. This is presented in Algorithm 2.

figure b

The culling algorithm requires only a single pass of the sentence dataset. In the worst case, when no sentences are culled, performance scaling is given as \(\hbox {O}\left( {sa} \right) \), where s is the number of sentences and a is the number of attributes. In actuality, average performance is better, due to attributes not passing the threshold, resulting in fewer than a attributes being calculated for some sentences. Average complexity will depend on the culling thresholds and the nature of the sentence dataset.

9.2 Metaheuristic comparison

Table 7 Summarization algorithm comparison
Table 8 Movie review summaries
Table 9 Pony Express summaries
Table 10 Keepsake summaries
Table 11 Clever Surveys summaries

In this section, we present our performance results for each of the following metaheuristics: canonical genetic algorithm (GA), particle swarm optimization (PSO), and gravitational search (GSA). Each algorithm runs for 1000

iterations with the best fitness recorded. Since these metaheuristics are stochastic, each test is run 1000 times and the results are averaged. Standard deviation (SD) is also presented to determine consistency of performance. This test is performed on three different sets of restaurant reviews. One-way analysis of variance (ANOVA) (Iversen and Norpoth 1987; Arcuri and Briand 2014) is also presented to ensure statistical significance of these results.

Each algorithm has a population size of 20. The GA has a mutation rate of 0.02, a crossover rate of 0.7, a standard roulette selection function, and uses one-point crossover. PSO has an \(\upomega \) value of 0.5, a \(\phi _p \)value of 0.5, and a \(\phi _g \)value of 0.5. Where \(\omega \) is a scaling factor for particle velocity, \(\phi _p \) is the pull of a particle’s best known position and \(\phi _g \) is the pull of the global best known position. GSA has an initial G value of 1 and a G reduction rate \(\upbeta \) of 0.5. The GA uses a binary chromosome of size \({ ceil}\left( {{ log}_2 S} \right) \times N\), where S is the number of sentences after culling and N is the number of sentences to include in the summary. To obtain sentences from this binary chromosome, a block of binary is decoded into an integer index N times. If an index is out of bound for the array of sentences, the chromosome is immediately given a very low fitness. The PSO and GSA use a solution vector of N real numbers, which can easily become indices by taking the floor of each real number.

The p values in Table 5 show high statistical significance in the difference between GA, PSO, and GSA, on each set of reviews, as given by the low p-values. As such, we can confidently state that the variation between performance means in Table 6 is due to the algorithms themselves, and not random variance. We note that, although ANOVA assumes normality of data and equality of variance, the central limit theorem states that, for large sample sizes, parametric tests like ANOVA are robust even if the data violate these assumptions (Arcuri and Briand 2014; Rice 2006; Sawilowsky and Blair 1992).

As shown in Table 6, GA achieves the highest fitness for all sets of reviews, with highest consistency, as indicated by the lowest standard deviation (SD). However, PSO and GSA follow closely. While PSO and GSA excel at exploring continuous problem spaces, the discrete nature of this problem voids that advantage. GA on the other hand is designed for discrete problem spaces such as the Gist problem tested here. GA also has higher exploration, giving it an advantage in this large problem space. PSO and GSA excel in problem spaces with smooth gradients, but selecting sentences results in a problem space resembling a complex step function with many basins of attraction. These basins easily trap PSO and GSA, while the mutation operators of GA allow it to escape and continue the search for the optimal solution. These factors favor the time-tested performance of GA, making it most effective at determining the best sentences to form a summary from a set of text.

This analysis leads us to implementing GA as the metaheuristic search component of Gist. However, Gist is naive to its particular metaheuristic algorithm. As long as a metaheuristic is compatible with the fitness function given in Algorithm 1, it will function with Gist. Given the vast number of metaheuristics developed, we do not perform an exhaustive comparison. Instead, we use it as guidance when selecting a metaheuristic for Gist. In the classic trade-off of exploration vs exploitation, GA has greater rate of exploration than PSO and GSA. This proves advantageous for the large problem space Gist must search. GA also searches a discrete problem space, while PSO and GSA excel at continuous problem spaces. As such, we conclude that high-exploration, discrete metaheuristics function best with Gist.

10 Benchmark and comparison

Rouge (Lin 2004), a tool for automatically evaluating text summarization, allows us to efficiently compare the effectiveness of Gist with state-of-the-art text summarization algorithms, on large text datasets. Rouge-1, a 1-gram method, with stop words, and without synonyms, is applied to compare the similarity of automatically generated summaries with expert, gold-standard summaries. Rouge-1 is shown to match human evaluations with high accuracy (Lin 2004).

Three datasets are benchmarked: Opinosis (Ganesan et al. 2010), containing 51 sets of reviews about hotels, cards, and various electronics, and professional summaries for each; single document DUC 2004 (http://www-nlpir.nist.gov/projects/duc/data/2004_data.html), containing 500 news articles from the AP and New York Times newswire, and professional summaries for each; and cmp-lg (http://www-nlpir.nist.gov/related_projects/tipster_summac/cmp_lg.html), containing 183 scientific papers from Association for Computational Linguistics conferences, with each corresponding abstract used as a professional summary.

Gist is compared to two state-of-the-art text summarization algorithms: TextRank (Mihalcea and Tarau 2004), a graph-based ranking model inspired by Google’s PageRank, and LexRank (Erkan and Radev 2004), a method for computing sentence importance based on eigenvector centrality with an intra-sentence cosine similarity matrix. We omit text summarization ensembles like MEAD from our comparison to focus on the performance of individual algorithms. Gist is fully capable of working with text summarization algorithms like LexRank, in an ensemble.

Table 7 presents average recall, precision, and F-score for each dataset and algorithm. These stats, in the context of text summarization, can be interpreted as follows:

  • Recall: How much necessary information does the summary contain?

  • Precision: How little unnecessary information does the summary contain?

  • F-score: How good is the summary? F-score is a combination of recall and precision.

Gist outperforms all comparison algorithms on all but the cmp-lg dataset, as indicated by the higher F-score. On the cmp-lg dataset, Gist shows comparable performance with a fraction of the runtime. Although Gist shows lower average recall, it consistently achieves high precision, leading to overall higher F-score. These results indicate that Gist shows improved ability to generate concise summaries while maintaining most of the necessary information extracted by state-of-the-art text summarization.

The modular nature of Gist allows for easy extension and improvement. A process of adding or modifying attributes and then testing performance on one or more datasets can be taken to improve the summarization ability of Gist. Rouge can be the test component of a modify-and-test improvement loop. Drawing from results presented in Table 7, the addition of attributes giving value to sentences based on text centrality, similar to LexRank, may improve the recall ability of Gist, leading to even higher performance.

11 Conclusion

Gist is the proposed powerful modular system for extractive summarization of text and reviews. Through metaheuristic search of sentences with Gist-calculated attribute values, Gist rapidly parses large amounts of text for the general option contained therein. This ability is proven with empirical evidence. By entering text from reviews for a product, the opinion is summarized in easily understandable human language that captures details not present in numeric ratings.

Sentiment analysis and TF–IDF relevance form the core attributes of each sentence in Gist. Additional attributes that provide heuristic guidance toward effective summary sentences are also presented in Sect. 6.1. By adding an attribute to sentences and assigning it a weight, Gist automatically integrates the new attribute into its summarization. This powerful mechanism allows for easy extension into new domains or improved summarization through a priori knowledge.

Our performance analysis shows that Gist scales linearly with the size of the text and the number of attributes, making it efficient for large-scale text parsing and data mining. Developers can easily adapt Gist to extract specific concepts from text while taking advantage of this fast performance. Such adaptation allows data miners to quickly adjust to trends or implement new ideas and improvements.

Our algorithm comparison in Sect. 9 proves Gist’s ability to effectively summarize both reviews and articles. Finally, we present real summaries generated by Gist, TextRank, and LexRank. Summaries of various popular movies are presented in Table 8. For each movie, user reviews from metacritic.com are obtained and each summarization algorithm summarizes the combined review text. Tables 910, and 11 each present a small selection of text relating to a topic or product, and a summary from each algorithm, generated from the text.