Abstract
Nowadays the need for multilingual information retrieval for searching relevant information is rising steadily. Specialized text-based forums on the Web are a valuable source of such information. However, extraction of informative messages is often hindered by large amount of non-informative posts (the so-called offtopic posts) and informal language commonly used on forums.
The paper deals with the task of automatic identification of posts potentially useful for sharing professional experience within text forums irrespective of the forum’s language. For our experiments we have selected subsets from various text forums containing different languages. Manual markup was held by native speaking experts. Textual, thread-based, and social graph features were extracted. In order to select satisfactory language-independent forum features we used gradient boosting models, relative influence metric for model analysis, and NDCG metric for measuring selection method quality.
We have formed a satisfactory set of forum features indicating the post’s utility which do not demand sophisticated linguistic analysis and is suitable for practical use.
Access provided by Autonomous University of Puebla. Download to read the full chapter text
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Abbasi, A., Chen, H., Salem, A.: Sentiment Analysis in Multiple Languages: Feature Selection for Opinion Classification in Web Forums. The University of Arizona (2007). http://ai.arizona.edu/intranet/papers/AhmedAbbasi_SentimentTOIS.pdf
Alguliev, R.M., Aliguliyev, R.M., Hajirahimova, M.S., Mehdiyev, C.A.: MCMR: Maximum coverage and minimum redundant text summarization model. Expert Systems with Applications 38, 14514–14522 (2011)
Banea, C., Mihalcea, R., Wiebe, J.: Sense-level subjectivity in a multilingual setting. Computer Speech and Language 28, 7–19 (2014)
Biyani, P., Bhati, S., Caragea, C., Mitra, P.: Using non-lexical features for identifying factual and opinionative threads in online forums. Knowledge-Based Systems 69, 170–178 (2014)
Carbonaro, A.: WordNet-based Summarization to Enhance Learning Interaction Tutoring. Peer Reviewed Papers 6(2) (2010)
Chen, J.-S., Hsieh, C.-L., Hsu, F.-C.: A study on Chinese word segmentation: Genetic algorithm approach. Information Management Research 2(2), 27–44 (2000)
Ding, S.L., Cong, G., Lin, C.Y., Zhu, X.Y.: Using conditional random fields to extract contexts and answers of questions from online forums. In: Proceedings of the 46th Annual Meeting of the Association of Computational Linguistics, Columbus, Ohio, pp. 710–718. ACL (2008)
Freeman, L.C.: Centrality in social networks: Conceptual clarification. Social Networks 1, 215–239 (1978)
Friedman, J.: Greedy boosting approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2001)
Garbacea, C., Tsagkias, M., de Rijke, M.: Feature Selection and Data Sampling Methods for Learning Reputation Dimensions. The University of Amsterdam at RepLab 2014 (2014). http://ceur-ws.org/Vol-1180/CLEF2014wn-Rep-GarbaceaEt2014.pdf
Generalized Boosted Regression Models. http://cran.r-project.org/web/packages/gbm/index.html
Hogenboom, A., Heerschop, B., Frasincar, F., Kaymak, U., de Jong, F.: Multi-lingual support for lexicon-based sentiment analysis guided by semantics. Decision Support Systems 62, 43–53 (2014)
Huang, C.-C.: Automated knowledge transfer for Internet forum. Master thesis, Graduate School of Information Management, I-Shou University, Taiwan, ROC (2003)
Li, Y., Liao, T., Lai, C.: A social recommender mechanism for improving knowledge sharing in online forums. Information Processing and Management 48, 978–994 (2012)
Ren, Z., Ma, J., Wang, S., Liu, Y.: Summarizing web forum threads based on a latent topic propagation process. In: CIKM 2011, October 24–28, Glasgow, Scotland, UK (2011)
Jones, K.S.: Automatic summarising: the state of the art. Information Processing and Management, Special Issue on Automatic Summarising (2007)
Steinberger, R.: Challenges and methods for multilingual text mining. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.167.4724
Tao, Y., Liu, S., Lin, C.: Summary of FAQs from a topical forum based on the native composition structure. Expert Systems with Applications 38, 527–535 (2011)
Wang, B., Liu, B., Sun, C., Wang, X., Sun, L.: Thread Segmentation Based Answer Detection in Chinese Online Forums. Acta Automatica Sinica 39(1) (2013)
Wang, L., Cardie, C.: Summarizing decisions in spoken meetings. In: Proceedings of the Workshop on Automatic Summarization for Different Genres, Media, and Languages, Portland, Oregon, June 23, 2011, pp. 16–24. Association for Computational Linguistics (2011)
White, D.R., Borgatti, S.P.: Betweenness centrality measures for directed graphs. Social Networks 16, 335–346 (1994)
Yang, S.J.H., Chen, I.Y.L.: A social network-based system for supporting interactive collaboration in knowledge sharing over peer-to-peer network. International Journal of Human Computer Studies 66(1), 36–40 (2008)
Zhou, L., Hovy, E.: Digesting virtual geek culture: the summarization of technical internet relay chats. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL 2005, Stroudsburg, PA, USA, pp. 298–305. Association for Computational Linguistics (2005)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Grozin, V.A., Gusarova, N.F., Dobrenko, N.V. (2015). Feature Selection for Language Independent Text Forum Summarization. In: Klinov, P., Mouromtsev, D. (eds) Knowledge Engineering and Semantic Web. KESW 2015. Communications in Computer and Information Science, vol 518. Springer, Cham. https://doi.org/10.1007/978-3-319-24543-0_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-24543-0_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24542-3
Online ISBN: 978-3-319-24543-0
eBook Packages: Computer ScienceComputer Science (R0)