Abstract
In text summarization, relevance and coverage are two main criteria that decide the quality of a summary. In this paper, we propose a new multi-document summarization approach SumCR via sentence extraction. A novel feature called Exemplar is introduced to help to simultaneously deal with these two concerns during sentence ranking. Unlike conventional ways where the relevance value of each sentence is calculated based on the whole collection of sentences, the Exemplar value of each sentence in SumCR is obtained within a subset of similar sentences. A fuzzy medoid-based clustering approach is used to produce sentence clusters or subsets where each of them corresponds to a subtopic of the related topic. Such kind of subtopic-based feature captures the relevance of each sentence within different subtopics and thus enhances the chance of SumCR to produce a summary with a wider coverage and less redundancy. Another feature we incorporate in SumCR is Position, i.e., the position of each sentence appeared in the corresponding document. The final score of each sentence is a combination of the subtopic-level feature Exemplar and the document-level feature Position. Experimental studies on DUC benchmark data show the good performance of SumCR and its potential in summarization tasks.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Aggarwal CC, Yu PS (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24(2): 171–196
Aliguliyev RM (2006) A novel partitioning-based clustering method and generic document summarization. In: Proceedings of the 2006 IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology, pp 626–629
Amini MR, Gallinari P (2002) The use of unlabeled data to improve supervised learning for text summarization. In: Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval, pp 105–112
Arora R, Ravindran B (2008) Latent dirichlet allocation and singular value decomposition based multi-document summarization. In: Proceedings of the international conference on data mining, pp 713–718
Barzilay R, Lee L (2004) Catching the drift: probabilistic content models, with applications to generation and summarization. In: HLT-NAACL: proceedings of the main conference, 2004, pp 113–120
Baxendale P (1958) Machine-made index for technical literature-an experiment. IBM J Res Dev 2(4): 354–361
Carbonell JG, Goldstein J (1998) The use of MMR, diversity-based reranking for reordering documents and producing summaries. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval, pp 335–336
Celikyilmaz A, Hakkani-Tur D (2010) A hybrid hierarchical model for multi-document summarization. In: Proceedings of the 48th annual meeting of the association for computational linguistics (ACL 2010), pp 1149–1154
Conroy JM, O’Leary DP (2001) Text summarization via hidden markov models. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, pp 406–407
Edmundson HP (1969) New methods in automatic extracting. J Assoc Comput Mach 16(2): 264–285
Erkan G, Radev DR (2004) LexPageRank: prestige in multi-document text summarization. In: Proceedings of empirical methods in natural language (EMNLP 2004), pp 365–371
Feng S, Wang D, Yu G, Gao W, Wong K-F (2011) Extracting common emotions from blogs based on fine-grained sentiment clustering. Knowl Inf Syst 27(2): 281–302
Gong Y, Liu X (2001) Generic text summarization using relevance measure and latent semantic analysis. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’01), pp 19–25
Haghighi A, Vanderwende L (2009) Exploring content models for multi-document summarization. In: Proceedings of human language technologies: the 2009 annual conference of the North American chapter of the association for computational linguistics (NAACL’09), pp 362–370
Hovy E, Lin C-Y (1999) Automated text summarization in SUMMARIST. In: Mani I, Maybury M (eds) Advances in automatic text summarization. The MIT Press, Cambridge, pp 81–94
Jing H (2000) Sentence reduction for automatic text summarization. In: Proceedings of 6th conference on applied natural language processing (ANCL’00), pp 310–315
Knight K, Marcu D (2002) Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artif Intell 139(1): 91–107
Kupiec J, Pedersen J, Chen F (1995) A trainable document summarizer. In: Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’95), pp 68–73
Lee J-H, Park S, Ahn C-M, Kim D (2009) Automatic generic document summarization based on non-negative matrix factorization. Inf Process Manag 45(1): 20–34
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out: proceedings of the ACL-04 workshop of ACL 2004, pp 74–81
Lin C-Y, Hovey E (2003) Automatic evaluation of summaries using n-gram co-occurence statistics. In: Proceedings of the 2003 conference of the North American chapter of the association for computational linguistics on human language technology, pp 71–78
Long C, Huang M, Zhu X, Li M (2009) Multi-document summarization by information distance. In: Proceedings of the 2009 Ninth IEEE international conference on data mining (ICDM’09), pp 866–871
Mani I (2001) Automatic summarization. John Benjamin’s Publishing Company, Amsterdam
McCallum AK (1996) Bow: a toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow/
Mei J-P, Chen L (2010) Fuzzy clustering with weighted medoids for relational data. Pattern Recognit 43(5): 1964–1974
Mihalcea R, Tarau P (2004) TextRank: bringing order into texts. In: Dekang L, Dekai W (eds) Proceedings of empirical methods in natural language (EMNLP 2004), pp 404–411
Moschitti A (2009) Syntactic and semantic kernels for short text pair categorization. In: Proceedings of the 12th conference of the European chapter of the association for computational linguistics, Athens, pp 576–584
Nenkova A, Vanderwende L (2005) The impact of frequency on summarization. Technical Report, Microsoft Research, MSR-TR-2005-101
Neto JL, Santos AD, Kaestner CA, Freitas, AA (2000) Document clustering and text summarization. In: Proceedings of the 4th international conference on practical applications of knowledge discovery and data ming (PAKDD’00), pp 41–55
Nomoto T, Matsumoto Y (2001) A new approach to unsupervised text summarization. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’01), pp 26–34
Otterbacher JC, Radev DR, Luo A (2002) Revisions that improve cohesion in multi-document summaries: a preliminary study. In: Proceedings of the ACL02 workshop on automatic summarization, pp 27–36
Quan X, Liu G, Lu Z, Ni X, Wenyin L (2010) Short text similarity based on probabilistic topics. Knowl Inf Syst 25(3): 473–491
Radev DR, Jing H, Stys M, Tam D (2004) Centroid-based summarization of multiple documents. Inf Process Manag 40(6): 919–938
Shen D, Sun J-T, Li H, Yang Q, Chen, Z (2007) Document summarization using conditional random fields. In: Proceedings of the 20th international joint conference on artificial intelligence, pp 2862–2867
Tang J, Yao L, Chen D (2009) Multi-topic based query-oriented summarization. In: Proceedings of the SIAM international conference on data mining, pp 1147–1158
Vanderwende L, Suzuki H, Brockett C, Nenkova A (2007) Beyond sumbasic: task-focused summarization with sentence simplification and lexical expansion. Inf Process Manag 43(6): 1606–1618
Wang D, Li T, Zhu S, Ding C (2008) Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’08), pp 307–314
Wei F, Li W, Lu Q, He Y (2010) A document-sensitive graph model for multi-document summarization. Knowl Inf Syst 22(2): 245–259
Zhao L, Wu L, Huang X (2009) Using query expansion in graph-based approach for query-focused multidocument summarization. Inf Process Manag 45(1): 35–41
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mei, JP., Chen, L. SumCR: A new subtopic-based extractive approach for text summarization. Knowl Inf Syst 31, 527–545 (2012). https://doi.org/10.1007/s10115-011-0437-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-011-0437-x