1 Introduction

Business organizations, nowadays, use big data available on the World Wide Web (WWW) to gather competitor intelligence (CI) (Wright et al. 2002), thereby augmenting traditional channels used for this purpose (Groom and David 2001). Due to heterogeneous nature and volume of such data, searching necessary information (Browne et al. 2017) and its representation (O’Reilly 1983) is extremely important to enable managers to use and integrate it effectively in their decision-making process (Kowalczyk 2014). Existing CI systems, such as CIntell (Donohue and Murphy 2016), do not address the need for generating high-quality, concise information from gathered data in the form of reports. Automatic text summarization (ATS) (Mani and Maybury 1999; Radev et al. 2002) is a technique that can be applied to this huge volume of diverse information on competitors to present it in concise form, for quick and easy reference by decision makers. This reduces information overload and enhances usability (Okike and Fernandes 2012; Xu et al. 2011) of this data. The primary drivers of this research are, recent research on application of ATS and text clustering on CI corpora, relevance and requirements of such application in the context of information systems (IS) theories (Miller 1956; Cohen and Levinthal 1990; Browne et al. 2017; O’Reilly 1983), availability of state-of-the-art ATS & clustering technologies, and, most importantly, increasing interest of business decision makers in such applications.

Essentially, this study aims at designing an integrated system for segregating diverse text documents related to the competitor(s) of business organizations (gathered from the WWW) into topical clusters, providing machine-generated topic-wise summaries suitable for supporting CI-specific decision making, and empirically demonstrating that such a system would be practicable and useful. The significant contributions of this research are as follows:

  • To design and implement an integrated system for CI-specific information analysis by managers using ATS and text clustering and assess feasibility in terms of summary quality and usability.

  • To evaluate CI-specific extractive summaries (Mani and Maybury 1999) generated using ATS techniques based on global optimization by applying standard metrics, namely, recall/precision (Lin 2004).

  • To show that extractive summaries generated using global optimization techniques are useful from CI perspective by using the feedback of the practicing managers and subsequently linking the findings to relevant IS theories.

A majority of the organizations cannot utilize the CI gathered from WWW in their strategic planning owing to information overload (Gilad 2015; Gilad and Fuld 2016) and lack of proper content, quality and form (O’Reilly 1983) of the textual data which affects usability. To address these usability aspects, proposed CI analysis system integrates clustering prior to summarization phase, because generation of a monolithic summary from entire CI corpus is not appropriate for analysis due to its heterogeneity, possibility of greater information loss and limitation of human memory (Miller 1956). The segregation of topics using clustering for CI-based decision making is also justified from Analytic Hierarchy Process (AHP) perspective as noted in Wang and Forgionne (2006). Although combination of topic identification by clustering followed by summarization has been studied by some researchers (Radev 2004) in a different context, it needs to be re-evaluated for the proposed CI analysis system. Therefore, appropriate text clustering technique (Chakraborti and Dey 2016) is chosen based on relevance and used prior to summarization step for the CI corpus with a view to obtain good quality summary with homogeneous information content.

Extractive summarization (Mani and Maybury 1999) is chosen as representation to ensure the reliability of information content which is critical for CI analysis. Global optimization-based summarization is used because this is found to generate qualitatively better abstractive summaries (refer Sect. 2.4). To validate that these techniques are effective for extractive summarization, first, a set of global optimization techniques for ATS are chosen based on state-of-the-art literature. This is followed by evaluation of system summaries against the golden reference summaries created by human volunteers, and against summaries generated by the greedy approach for benchmarking purpose.

Since usability of any system can best be judged by its users, inputs from practicing managers are also obtained, using system generated summaries based on various metrics, namely, “content”, “form”, “quality” (O’Reilly 1983), “information need”, “information use” (Browne et al. 2017) etc. The findings from this analysis show that the proposed CI analysis system fulfills these important requirements of such decision support system and can be considered as the first step towards “theoretical development and systematic investigation” (Browne et al. 2017) in the domain of CI information system design.

2 Literature Review

The review of extant literature is conducted in four areas relevant to current research objective, namely, information systems theories, information systems related to CI analysis, text clustering, and automatic text summarization.

2.1 Information Systems Theories

Many IS theories have highlighted the need for concise and accurate content along with a specific form of information to process effectively, due to the inherent capacity limit of human memory (Miller 1956). More specifically, requirement of “quantity, quality, saliency, content, form and credibility” (O’Reilly 1983; Wilson 1981) of information and that of useful chunk size (Miller 1956) to enable better processing by decision makers, can potentially be fulfilled by generation of cluster (chunk)-specific extractive summaries (content/form) from CI corpus by applying appropriate techniques. From information system design perspective, Browne et al. (2017) stated that usability of a system depends on four critical factors, namely “information requirements”, “information needs”, “information demand” and “information use”. These factors are extremely relevant for current research problem as well because clustering and ATS techniques, integrated in a CI system, have the potential to address them. The usability aspect of any information system is also dependent on the absorption of information (Cohen and Levinthal 1990) for creating sustainable competitive advantage (Tallon et al. 2013–2014) for users of the system. This can be facilitated by presenting cluster-specific summaries instead of single monolithic summary. Hence, from perspective of information systems theory, current research on CI analysis system is quite relevant and can be considered as one forward step towards “theoretical development and systematic investigations” (Browne et al. 2017) of “information requirements” of decision-makers (Browne et al. 2017) in the era of big data.

2.2 Information Systems for CI

In general, handling of competitive strategies and competitors is part of strategic information systems (SIS) (Rackoff et al. 1985) or executive information systems (EIS), both of which have remained primarily focused on designing organization-wide IS which leads to competitive advantage. The analysis of competitors is primarily carried out by either evaluating relevant reports or by gathering information through various traditional channels. Current CI information systems, such as CIntell (Donohue and Murphy 2016), focus on this information gathering part on competitors but do not seem to be an efficient IS architecture for decision-making in terms of reducing information overload (Okike and Fernandes 2012; Xu et al. 2011) or addressing the “content” and “form” and “quality” as mentioned in O’Reilly (1983). Creating intelligent CRM system (Zaby and Wilde 2017) which indirectly helps companies to design competitive strategy by gathering CI, has been the subject of some studies. But this focuses more on processes and less on quality of such information. Similarly, Web 2.0 and big data techniques have been used to devise methodology for gathering and building knowledge management system within organizations (Orenga-Roglá and Chalmeta 2017). But clearly a gap exists in applying the latest available technologies to building a system to gather and analyze specifically CI for decision making.

2.3 Text Clustering

Clustering, which is an important technique of unsupervised learning, has found several applications in data mining (Jain et al. 1999; Bissantz and Hagedorn 2009), including customer segmentation in retail (Lockshin et al. 1997), energy (Flath et al. 2012), business process modeling (Wang et al. 2016) etc. Amongst various clustering techniques, K-means (MacQueen 1967; Hornik et al. 2012) has been found to be very effective for high performance applications and has been widely used. More specifically, text clustering technique based on K-means has been used for various document processing applications as well, including news aggregation and recommendation (Carullo et al. 2009), topic detection and tracking (Allan et al. 2000) group web search queries (Dumais and Chen 2000), sentiment analysis (Ravi and Ravi 2015), opinion mining (Ravi and Ravi 2015), word grouping (Bellegarda et al. 1996) etc. Although K-means clustering has been applied in various domains, there is little evidence of its application for CI in extant literature. Although Chakraborti and Dey (2016) have proposed one adaptation of the K-means clustering technique for finding topical groups within a CI corpus, the quality of these clusters from CI perspective, as evaluated by managers, is missing. Hence, the gap which remains in extant research is to evaluate the quality of such clusters more rigorously from CI perspective.

2.4 Automatic Text Summarization

Text summarization is also found to be widely researched topic, including techniques based on natural language analysis (DeJong 1978; Barzilay and Elhadad 1997), semantics (Marcu 1998), discourse (Marcu 1998), ontology (Jishma Mohan et al. 2016; Baralis et al. 2013), graph (Erkan and Radev 2004; Wang et al. 2013), Wikipedia (Sankarasubramaniam et al. 2014) etc. Recent research papers are available on text summarization based on various global optimization techniques, namely, Quadratic integer programming (QIP) (Alguliev et al. 2013), integer programming (Alguliev et al. 2011a, b), Genetic algorithms (GAs) (Mendoza et al. 2014; Alguliev et al. 2014), differential evolution (DE) (Alguliev et al. 2011a, b, 2012), artificial bee colony (ABC) optimization (Karaboga and Basturk 2007; Chakraborti and Dey 2015) etc. These global optimization-based techniques have generated better results vis-à-vis greedy techniques, for abstractive summaries based on standard data sets, e.g., DUC. Although ATS has been used for summarizing from multiple sources, such as patents (Tseng et al. 2007; Codina-Filbà et al. 2017), biomedical text (Reeve et al. 2007), research papers (Lloret et al. 2013), IMF country reports (Ackermann et al. 2006), product reviews (Zhan et al. 2009; Hu et al. 2017), court decisions (Moens 2007), product news (Chakraborti and Dey 2015), it has not been used specifically for extracting summaries from corpora created with the intention of gathering information on multiple aspects of a business organization’s competitors. The conceptual framework proposed in Chakraborti and Dey (2014) and Chakraborti (2015) proposing ATS as a component for creating summaries from CI corpora lacks the support of any empirical analysis that shows the effectiveness of ATS for generating useful system summaries. The other work (Chakraborti and Dey 2015) focuses only on one aspect of CI, i.e., product news summarization. The current research work tries to address these gaps.

3 Research Methodology

It is evident from literature review that ATS is a promising technology for creating concise representations from CI data. Moreover, the CI-specific summaries should preferably be based on homogeneous and relatively small corpus to ease information absorption (Miller 1956; Cohen and Levinthal 1990) and avoid loss of data. Hence, the methodology used in this research has focused on design of an IS system/prototype (“artifact”) using two components, namely clustering and ATS, as the first step, followed by evaluation of its “utility”. Essentially, the approach is aligned with design science research for IS (ISDSR) paradigm (Simon 1996; Hevner et al. 2004; Fischer et al. 2010) which has been applied to multiple areas of IS design to date (Heinrich and Schwabe 2017; Simon 2010; Oberle et al. 2009; Bitzer et al. 2015). The following paragraphs briefly explain how seven guidelines of ISDSR methodology map to various tasks/activities of current research.

3.1 Problem Relevance

As per ISDSR methodology (Hevner et al. 2004), “problem relevance” cycle (Hevner 2007) captures system specific requirements (Kotonya and Sommerville 1998; Stroh et al. 2011) from business as well as technical perspectives. For this research, broad requirements, in terms of content, size, form, use etc., have emerged from relevant literature review as discussed earlier. These are triangulated by one-to-one focused discussion with few senior business leaders who agreed with these requirements and showed eagerness to participate in evaluation of the system. Overall, the unavailability of any effective CI analysis system to date, combined with increase in textual data size for analysis, also makes current research highly relevant for CI-specific decision-making.

3.2 Research Rigor

It is evident that current research has theoretical foundations in IS theories as well as in past research in relevant domains, namely ATS, text clustering, and it draws validity and applicability of its components from these. The requirements regarding form, content, quality and usability of such CI information, are drawn from IS theories and from inputs of practicing managers. Unlike previous research in this domain (Chakraborti and Dey 2015), which considers only product news, current research considers multiple aspects of CI, namely, finance, products, research, mergers, social work etc., in its analysis and provides more comprehensive findings. Employing a neutral set of managers for summary evaluation, different from volunteers used for golden extractive summary creation, also ensures that bias is avoided during the analysis.

3.3 Design as a Search Process

As per ISDSR methodology, designing a robust “artifact” requires comprehensive, if not exhaustive, exploration of design space/alternatives. In this research, this guideline has been followed in several aspects. The choice of alternatives of various design components, i.e., clustering and ATS, are based on extensive review of the state-of-the-art literature. Use of multiple optimization techniques, namely, ABC, DE, GA, MMR for summary generation, enables comparison of summary qualities and the choice of best alternative. The requirements of the CI system in terms of quality of information content, form, usability etc. are gathered not only from relevant literature, but from practicing managers as well. Summary generation for each of the global optimization-based techniques is performed by varying the population size to see the effect on summary quality. This research also considers clusters related to multiple CI aspects of a competitor generated from the CI corpus, rather than focusing on single type of cluster as in Chakraborti and Dey (2015).

3.4 Design as an Artifact

The CI analysis system (“artifact”) is created by integrating two components, namely, ML-KM clustering (Chakraborti and Dey 2016) and the ATS engine based on global optimization. ML-KM clustering, which is designed to handle CI corpus, ensures that the CI corpus used in current research is first segregated according to broad topics such as finance, products, research, mergers, social work etc., which are relevant to CI. Then each of these clusters, which are much smaller in size than the original corpus, can be used for generating extractive summaries by the ATS engine. Thus, the final summaries generated by the integrated system are easier to comprehend and can be used more effectively for decision-making. This integrated “artifact” will reduce complexities of analytic information system (Arnott and Pervan 2008) and will be more effective from usability perspective as compared to generating single summary from a large monolithic heterogeneous CI corpus. It should be noted that information content for the CI summaries need to be of high quality, and hence ATS techniques based on global optimization, which have been shown to be effective for other application areas as per literature, are also chosen for designing this integrated system.

3.5 Design Evaluation

The evaluation of the integrated system focuses on two important aspects, namely, the quality of the extractive summaries which captures nature of information content and the overall usability/effectiveness of these summaries from CI-specific decision-making perspective. The quality of information content of system summaries, generated using global optimization-based techniques, is measured vis-a-vis human-created golden extractive summaries using standard metrics recall/precision (Lin 2004). The practical utility of these CI-specific summaries regarding various items of requirements/usability is collected from practicing managers of business organizations. Both these criteria of evaluation, together, ensure necessary rigor of analysis to find out if the “artifact” addresses the practical issues of a CI information system as much as possible. Secondly, this approach provides a form of triangulation for the research findings as well. Managerial feedback is also obtained on comparative overall effectiveness of ABC- and DE-based global optimization-based ATS techniques in addition to a comparison of their recall/precision scores.

3.6 Research Communication and Research Contribution

The presentation of research findings, their implications including details of experimentations and empirical analysis, is conducted rigorously to convey the novelty and effectiveness of the integrated system. The significant contributions of this research are also presented in detail. Some possible enhancements for future are listed to provide a guideline for exploration of options to improve the integrated system.

4 Data Collection and Preparation

The CI specific corpus is created by conducting targeted search for documents pertaining to a company/competitor from various sources on the Internet. This is very different from the typical benchmarks, namely, DUC (http://www.nist.gov) or Reuters (Reuters 1987) datasets and their corresponding abstractive summaries which are grouped based on certain themes. Therefore, using the DUC or Reuters datasets along with their corresponding abstractive summaries, directly for evaluation of CI-oriented system summaries, is not appropriate. It should also be noted that the primary target of this research is not to benchmark against an existing summarization algorithm using publicly available datasets such as DUC. On the contrary, the goal is to evaluate the integrated system for its usability and viability. Hence, an in-house CI-specific corpus is created for this study, by collecting news, research, financial stories of a specific organization (Samsung) from various online resources (Chakraborti and Dey 2016). In addition to the in-house CI corpus, Reuters (“acq” category), DUC 2001 (five sets) and DUC 2005 (four sets) corpora have also been used to validate the experimental results, some based on relevance to CI and some chosen randomly. A golden extractive summary for each topical cluster was created by human volunteers by selecting 10% (compression ratio) of the sentences. These experts who volunteered for creating golden summaries from topical CI clusters included 44 senior faculty members from across India who participated in a Faculty Development Program (FDP) in 2016 at IIM Indore (India), 6 academic associates and 10 doctoral students from IIM Indore, a total of 60. A subset of 120 clusters (two clusters per participant) was created using a combination of quota and judgment sampling from the original set of 1211 clusters ensuring representation of various cluster types such as finance, products, research, merger and acquisitions, social activities, relevant to CI. Next, two clusters per volunteer were assigned randomly after briefing them about summary generation guidelines. Out of 120 expected summaries, only 70 submissions (golden) were received, which consisted of 46 summaries from the Samsung CI corpus, 17 summaries from the Reuters corpus and seven summaries from DUC datasets.

5 Design of Integrated System: Generating Extractive Summaries from Clusters

The extractive summaries are generated by applying global optimization-based summarization techniques, namely ABC, DE, and GA on 70 topical clusters. The greedy-based approach, i.e., MMR technique is also used to generate summaries from the same set of clusters for comparison. As mentioned earlier, the clusters are generated using ML-KM clustering (Chakraborti and Dey 2016), with one modification, namely, use of a randomized Latent Semantic Analysis (LSA) (Halko et al. 2010) instead of a standard LSA (Deerwester 1990) representation, as used in the original paper (Chakraborti and Dey 2016).

5.1 The Summary Scoring Function

The quality of the generated summaries’ information content is crucial for CI analysis by managers. Therefore, the scoring function focuses on this aspect, first by creating a centroid (Radev et al. 2000) of each cluster consisting of “informative” words, and then by measuring the similarity of the candidate summaries with the centroid. For this research, approach based on basic term frequency (TF) (Luhn 1958) was chosen for identifying “informative” words, despite the fact that there are many advanced techniques of doing so, namely, term frequency * inverse document frequency (TF * IDF) score (Luhn 1958), latent semantic analysis (Deerwester 1990), latent Dirichlet allocation (Blei 2003) etc. One reason for this is to evaluate the summaries generated using basic TF-based centroids first and adopt the advanced techniques for future extensions as per requirement. Secondly, by using the simple TF-based technique, it is ensured that all keywords, domain-specific acronyms etc. which are common to such corpus, remain part of the centroid leading to extraction of relevant sentences for CI analysis. The formulation of various key components of the scoring function are explained below.

5.1.1 Measuring Summary Centrality: Formulation of the Centroid

The similarity between the candidate system summary and the centroid of topical cluster is denoted as SC0 and this forms the first component of the summary score function (i.e., the objective function) used in this research:

$${\mathbf{S}}_{{{\mathbf{C0}}}} = {\text{Similarity}}\;{\text{of}}\;{\text{summary}}\;{\mathbf{S}}\;{\text{with}}\;{\text{centroid}}\;{\mathbf{C}}_{{\mathbf{0}}}$$
(1)

The centroid of each topical cluster is created first by ranking the words within a cluster using the TF values and then using the top 500 words (or less) in the ranked list as the central theme, or centroid.

5.1.2 Measurement of Redundancy in the Summaries

This research uses the concept of “Total Penalty” (Chakraborti and Dey 2015), defined by Eq. 2, as a measure of redundancy of a candidate summary as a single-unit. “Total Penalty” is the summation of penalties (P) computed for each sentence one by one in candidate summary. The formula for total penalty, for a candidate summary with n sentences, is given below:

$${\text{Total}}\;{\text{penalty}}\;({\mathbf{TP}}) = \sum\limits_{i = 1}^{n} {P_{i} }$$
(2)

5.1.3 Relative Length of Summary as a Measure of Information Content

The relative length (RL) of a summary is defined as:

$${\mathbf{RL}} = \frac{{{\text{Number}}\;{\text{of}}\;{\text{Words}}\;{\text{in}}\;{\text{Candidate}}\;{\text{Summary}}}}{{{\text{Number}}\;{\text{of}}\;{\text{Words}}\;{\text{in}}\;{\text{Topical}}\;{\text{Cluster}}\;{\text{i.e}}.\;{\text{Corpus}}}}$$
(3)

More proportion of words in the candidate summary indicates more information content.

5.1.4 Formula for Computing Total Summary Score (TSS)

Combining Eqs. (13), the total score of a summary is computed as follows:

$${\text{Total}}\;{\text{Score}}\;{\text{of}}\;{\text{Summary}}\;({\mathbf{TSS}}) = {\mathbf{S}}_{{{\mathbf{C0}}}} + {\mathbf{RL}}{-}{\mathbf{TP}}$$
(4)

While the first two terms in Eq. (4) measure the information content, in terms of centrality and length, the third term adds a penalty for redundancy in the candidate summary. This formula (Chakraborti and Dey 2015) is used to score candidate summaries generated by global optimization techniques, i.e., ABC, GA, and DE. For MMR-based technique which is incremental greedy approach, following function is used:

$${\text{Total}}\;{\text{Score}}\;{\text{of}}\;{\text{Summary}}\;({\mathbf{TSS}}) = {\mathbf{S}}_{{{\mathbf{C0}}}} + {\mathbf{RL}}$$
(5)

5.2 Description of the Optimization Problem for Summary Generation

Based on above discussion, the simple single-objective optimization problem formulation of automatic text summarization for CI clusters can be written as:

figure a

The compression ratio (configurable) indicates the percentage of sentences selected from a topical cluster to be included in the corresponding extractive summary.

5.3 Solution to the Optimization Problem Using Stochastic Algorithms

The basic intention here is to generate the best quality candidate summary as the solution based upon the optimization function (Eq. 4).

5.3.1 Encoding of Candidate Summaries

Each candidate summary, i.e., each potential solution, will be described by an integer vector of length N, where N is the number of sentences within the summary, which is based on total number of lines (grammatically complete English text lines) in the topical cluster and compression ratio.

The range of values at each index in the summary solution vector is [0:MAX_LINE_NUMBER − 1], where MAX_LINE_NUMBER is the total number of lines in the cluster where each line is assigned a unique id based on its sequence in text. Figure 1 shows one such solution vector. Note that the final generated summary uses this solution vector to select the text corresponding to these sentences and presents the user with these lines from original set of sentences in the cluster following text sequence.

Fig. 1
figure 1

Integer vector of a candidate summary

5.3.2 Generation of System Summaries using the Optimization Techniques

The parameter configuration of four optimization techniques, namely, ABC (Karaboga and Akay 2011), DE (Storn and Price 1996), GA (Holland 1975), and MMR (Carbonell and Goldstein 1998), used for summary generation from clusters are shown in Table 1. As mentioned before, use of four techniques ensures robust exploration of design alternatives.

Table 1 Parameters of optimization techniques

The compression ratio is taken as 10% for each of these cases. Increasing the population size/generations moderately did not result in significant improvement in the quality of extractive summaries. As candidate summary line number generation depends on randomization, duplicates can be generated in the solution vectors, i.e., population. For the current research, necessary algorithmic modifications and adaptations have been introduced to prevent duplicate sentence number generation in solution vectors during initialization. As a result, the selection criteria (Deb’s method) (Deb 2000) used in original ABC method (Karaboga and Akay 2011), can be bypassed as the duplicate removal step always ensures the generation of a feasible solution. The duplication avoidance technique is adapted for all three global optimization-based techniques. The ABC-based summarization required an adjustment of the fitness function and directly uses TSS (Eq. 4) rather than its inverse to affect maximization rather than minimization in original algorithm (Karaboga and Akay 2011).

6 Design Evaluation: Experimental Results

The quality of system summaries generated by the four optimization techniques, namely ABC, DE, GA, and MMR, are presented below along with relevant interpretation. The survey results obtained from practicing managers, regarding the quality of system generated summaries from CI perspective, and their mapping to IS theory, are also explained.

6.1 Measuring the Quality of the System Summaries

The quality of 70 system summaries generated by the global optimization techniques, namely ABC, DE, and GA, is measured against the golden summaries generated by human volunteers, using recall and precision scores based on Longest Common Subsequence (LCS) matching available with Recall-Oriented Understudy of Gisting Evaluation (ROUGE) (Lin 2004) tool. The average values of ROUGE-L (ROUGE LCS) recall and precision scores for summaries generated by each of these global optimization techniques are given in Table 2. The statistical significance of these values is also verified due to the fact that recall and precision scores greater than 0.40 are considered very good for standard dataset (Alguliev et al. 2011a, b, 2012). One sample t test was done using IBM SPSS tool with following set of hypotheses:

  • H0: µ = 0.42; H1: µ > 0.42 (recall)

  • H0: µ = 0.40; H1: µ > 0.40 (precision)

The respective values used in null and alternate hypotheses are found incrementally, starting from 0.40 as benchmark as noted earlier in Alguliev et al. (2011a, b, 2012) until results improved (up to two significant digits). Table 2 shows that average recall scores are greater than 0.42 which exceed 0.40 and thus are statistically significant, and average precision scores are equal to 0.40 as null hypothesis could not be rejected in this case.

Table 2 Average ROUGE-L recall and precision scores

For research paper summary generation, the average recall and precision scores were found to be 0.30 and 0.20 respectively (Lloret et al. 2013). Hence in comparison global optimization-based extractive summarization techniques perform better in the context of CI information analysis.

The other important finding is that all three global optimization (ABC-, DE-, GA-) based techniques of summarization perform, on average, better than MMR-based summarization (greedy approach) in terms of recall. This is validated statistically by running paired sample t-tests for ABC, DE, and GA recall scores against the MMR-based recall scores for 70 topical summaries with necessary p value adjustment (Bland and Altman 1995). But in terms of precision, all these techniques perform worse than MMR-based technique (again validated by pairwise comparison of precision scores with necessary p value adjustment) because current optimization function TSS (Eq. 4) does not check for mutual sentence-specific overlap. This can be taken up for future revisions of this research.

Pairwise comparison (paired sample t-tests with necessary p value adjustment) of recall (and precision) scores of ABC-, DE- and GA-based summaries reveal that statistically, all three are comparable, and hence any one of these three global optimization techniques, namely, ABC, DE, and GA, can be used for the analysis of CI corpora, unless other significant observations emerge from this data. This observation is reinforced by the fact that recall (and precision) scores of these three global optimization-based techniques are all strongly positively correlated (Pearson Correlation Coefficient > 0.90) and statistically significant.

6.2 Evaluation of Summaries by Managers: Linking Back to Information Processing Theory

To judge the value of the automatically generated summaries, a survey was conducted amongst senior decision makers of several business organizations, who were requested to evaluate samples of these summaries from different CI perspectives, namely, quality of the generated summaries, their usefulness etc. Essentially, the questionnaire consisted of items related to various decision-making criteria/requirements such as “information needs”, “information use”, “content”, “form”, “quality” etc. as mentioned in Browne et al. (2017), O’Reilly (1983) and Wilson (1981) and measured the scores on Likert scale (1–5) for statistical significance. The average scores of these parameters are shown in Table 3.

Table 3 Survey score of important parameters

The figures reveal that in terms of the generated summaries’ content/form, the average score is approx. 3.86/5 which is high and statistically significant. The “information needs” criteria specific to CI information is also encouraging (approx. 3.73/5) and statistically significant. In terms of “information use” criteria which measures whether decision-makers will use these summaries, the average score is approx. 4.11/5. Hence, overall, the summaries generated by the integrated system are found to address the requirements of information system design from CI perspective.

The survey also obtained preference scores from the managers regarding two types of summaries, namely, summaries generated by applying DE- and ABC-based optimizations regarding “information needs”. On an average, managers gave rating 3.28/5 to DE based summaries and 3.68/5 to ABC based summaries on a scale of 1–5, and both are found to be statistically significant. This implies that most managers found the automatically generated summaries by global optimization techniques useful from CI perspective, and ABC-based summaries are judged to convey better information on CI.

As to the second set of results, which is about overall scoring the ABC and DE based summaries, it is found that on an average, the managers gave rating of 5.65/10 to DE summary and rating 6.21/10 to ABC summary. This implies that managers have rated the ABC summaries higher than the DE summaries, and these results are also statistically significant.

The above response figures from managers regarding the important parameters empirically show that the extractive summaries generated by the system are useful from CI perspective and thus a step forward towards “theoretical development and systematic investigations of these foundations” (Browne et al. 2017).

7 Conclusion and Research Implications

This study presents design and evaluation of an integrated system for CI analysis for business decision makers using text clustering followed by ATS. More specifically, it has explored three important global optimization techniques, namely ABC, DE and GA, to generate extractive summaries from topical clusters (created by ML-KM clustering phase from CI corpus) and subsequently evaluate the quality of these summaries using recall/precision. Overall, the global optimization techniques (ABC, DE, GA) are found to generate better quality extractive summaries (with regard to the greedy-based MMR approach), although all three performed comparably against each other. This confirms the choice of any one of the global optimization-based techniques for generating extractive summaries by this CI analysis system. Secondly, the findings of standard metric-based (recall/precision) summary quality are triangulated by means of a summary evaluation conducted by practicing managers. This step shows how extractive summaries address the requirements of “information need” (Browne et al. 2017), “information use” (Browne et al. 2017) “content/form/quality” (O’Reilly 1983) for decision makers and make the task of CI analysis easier in the era of big data. This is also a validation for the effectiveness of extractive summaries generated by global optimization techniques, as a form of capturing CI information. This kind of analysis, which can be extended to include an appropriate dashboard for topical cluster and summary visualization with link back to source (LBS), will improve the CI analysis platform and associated business processes in strategic decision making. Some areas for future research are: use of other CI corpora for validation of the integrated system, use of techniques, such as, named entity recognition, part-of-speech tagging, topic identification using LDA etc. to improve the quality of the topical clusters and system generated summaries, use of abstractive summaries as CI representation, evaluation of performance/memory of the integrated system by varying optimization parameters, and study usability requirements at a deeper level to gain trust of decision makers on the information content so that the system eventually becomes part of real decision-making process.