Keywords

1 Introduction

Users are increasingly participating in online questioning and answering (Q&A) communities such as Yahoo! Answers, Reddit, and forums hosted on Stack Exchange to seek answers to their questions and/or provide solutions to solve others’ problems [1, 2]. In 2021 alone, Stack Exchange network had 3.2 million questions postedFootnote 1. That means there were on average 365 questions asked on the platform in every single hour. The efficiency and effectiveness of problem-solving in Q&A communities, however, depends on how quickly the submitted questions are made noticeable to experts with relevant knowledge as well as how potential answer providers perceive the usefulness of the questions. Accordingly, large online Q&A platforms such as Reddit and Stack Exchange have adopted the mechanism of user voting to filter/rank questions submitted to the community. Users can voluntarily and anonymously vote up or vote down questions submitted. Questions with the highest user votes are displayed on the top of the question list or recommended to potential problem solvers with the highest priority.

However, user voting of questions is not efficient especially in large online communities, since it requires a significant amount of cognitive effort spent in assessing various quality aspects of the content submitted. Furthermore, the voluntary nature of user voting in most online communities may lead to a systemic problem due to the error of omission [3]. Studies have shown that the percentage of users participating in content voting is relatively low in various online settings [3, 4]. In addition, user voting may also be seriously biased under certain conditions [5]. Thus, to facilitate effective and efficient knowledge exchange, an imperative task for Q&A communities is to automatically predict the usefulness of questions by using machine learning methods.

Machine learning is to learn patterns from data without explicit programming. There are two broad approaches to machine learning: classical machine learning and the recently developed deep learning methods. Although deep learning methods have shown prospects in various applications especially when large amounts of training data are available, the classical machine learning methods are still popularly applied in numerous scenarios. In the context of online Q&A communities, questions remain as to: (1) how classical machine learning and deep learning methods can be implemented to assess the usefulness of questions, (2) what are the design principles that can guide the implementation of machine learning methods, and (3) under what conditions deep learning methods would perform better than the classical methods.

To provide guidelines for research and practice, this research investigates the application of a set of classical machine learning and deep learning methods for predicting the usefulness rating of questions in online Q&A communities. A large dataset collected from a Q&A platform was used to train those machine learning methods and compare their predicting performance. The results of this research provide important implications for both the research and practice of online Q&A communities.

This paper is organized as follows. The next section reviews work related to the prediction of question usefulness, machine learning, deep learning, and word embedding methods for machine learning. Then, research method is explained in Sect. 3. Section 4 presents preliminary results. The Sect. 5 discusses the current work and future directions for improving the performance of predictive models.

2 Related Work

2.1 Usefulness of Questions

Rating the usefulness of user-generated content is a common mechanism on online platforms. For example, consumers can rate the usefulness of customer reviews posted by others [6, 7]. In Q&A communities, not all questions posted to the communities have an equal opportunity of being solved. Those questions that are perceived useful are deemed to receive more attentions from potential experts who have sufficient knowledge and experience to solve the problems. Thus, appropriately composing a question can often determine whether and how long the question will be solved. This can be comprehended from the perspective of signaling theory. Signaling theory suggests that people assess the quality of content through a variety of cues or signals that can help reduce information asymmetry between the information signaler and recipient [8]. Thus, knowledge seekers purposively include important information in their questions such that the questions could attract attention and interest from other peers in the community. Guided by the theoretical framework of signaling theory, this research proposes that a set of important cues can signal the usefulness of questions.

Specifically, there is an abundance of basic linguistic cues that can be used to transfer purposive information from one party to another. As presented in Table 1, a set of important cues such as informativeness, diversity, media richness, readability, spelling, and sentiment can be used to explain or predict the usefulness of questions in Q&A communities. In addition, features of Linguistic Inquiry and Word Count (LIWC) can also be used to predict the usefulness of questions. The validity and reliability of LIWC features have been verified by previous studies [9,10,11].

Table 1. Description of basic linguistic features.

2.2 Machine Learning and Feature Engineering

Machine learning methods can automatically learn structural patterns from data. In various application scenarios where analytical solutions are not possible and a dataset is accessible, machine learning methods are often preferred to construct empirical solutions such as spam filtering, credit scoring, product recommendation, and image recognition. The well-known no-free-lunch (NFL) theorem proposed by Wolpert [19] suggests that there is not such a single machine learning algorithm that performs best for all learning tasks. In other words, a comparison of machine learning methods (both classical and deep learning approaches) is needed for a specific domain task. A typical machine learning process includes data processing, feature extraction, feature selection, model training, model evaluation, and implementation.

A key factor for the success of machine learning projects is feature engineering that generates and prepares a set of important features from the raw data [20]. The process of feature engineering is also the key difference between classical machine learning methods (such as Linear Regression, Decision Trees, Support Vector Machines, Random Forests, and AdaBoost) and the recently developed deep learning methods. Classical machine learning methods rely on a manual process of feature engineering in which a set of important features need to be extracted from the raw data by experts, while deep learning methods have the capability of automatically extracting multiple levels of features from raw data [21].

2.3 Deep Learning

The recent advances in deep learning methods have motivated researchers and practitioners to apply deep neural networks to predict outcomes in numerous applications. Compared to classical machine learning methods, deep learning methods are more computationally expensive. Interestingly, deep learning methods tend to have good performance even when models overfit data [22], a phenomenon generally called benign overfitting [23]. With recent advances in algorithms and hardware, deep learning has emerged as an attractive learning algorithm for various applications including the classification or prediction of user-generated content on social media [24, 25]. Specifically, convolutional neural network (CNN) and recurrent neural network (RNN), the two major types of deep learning algorithms, have been used for various natural language processing and text mining tasks [26]. CNN was originally developed for image recognition by using convolution layers to automatically extract important features. RNN processes sequential data by using a loop structure to connect early state information back to the current state. Long-short term memory (LSTM) is a specific RNN model that was originally developed to learn long-term dependencies in the data [27].

2.4 Word Embedding

Machine learning methods applied for text mining usually require a specific type of embedding methods that map the raw data (characters, words, documents, etc.) to vectors that can be further fed into the machine learning models. The word2vec model [28] and doc2vec model [29] are two popular wording embedding methods for text mining such as sentiment analysis [30], online content quality assessment [31], and news classification [32]. Both the word2vec and doc2vec embedding methods can be used as an alternative to the traditional bag-of-words (BOW) approaches such as TF-IDF (term frequency-inverse document frequency) matrices.

Since the word2vec method only supports vector representation for words, the vector representation cannot be directly used for predictive analytics at document level. In practice, word2vec representations need to be aggregated to document level for document classification. Being an extension of the word2vec model, the doc2vec method directly learns the continuous representation of documents. Doc2vec is particularly attractive for various text mining tasks given its capability in capturing semantic meanings from textual data. Thus, this research applies the doc2vec embedding method. Specifically, two variants of doc2vec including distributed memory (DM) and distributed bag-of-words (DBOW) models are used to extract vector representations of online questions.

3 Research Method

An experiment was conducted to implement various classical machine learning methods and deep learning approaches to predict the usefulness of questions. Then those predictive models were compared. The following subsections explain the details of research method used in this study.

3.1 Data

The dataset was collected from a community-based open Q&A website for user experience designers and professionals. In the community, users can ask questions related to the design of user interfaces and answer questions posted by other peers. After a user submits a question to the community, other users can vote up or vote down the usefulness of the question. Those questions with the highest net votes (i.e., positive votes – negative votes) are displayed on the top of the question list so that all community users can first view them when looking at the question list. Figure 1 shows a sample question with usefulness votes.

Fig. 1.
figure 1

A sample question with usefulness votes.

The dataset contains 30,718 questions posted from January 2010 to November 2021. The whole dataset was split into a training set of 24,574 questions (80%) and a test set of 6,144 questions (20%). The training set was used to train machine learning models, with the test set used to test the performance of these models.

3.2 Predictive Modeling

Given that a question posted to the community can be voted up and down, usefulness of the question is dichotomized as a binary variable.

$$ Usefulness = \left\{ {\begin{array}{*{20}c} {1,\, {\text{if}}\, up\, votes - down\, votes \ge 1} \\ {0, \,{\text{if}} \,up \,votes - down \,votes \le 0} \\ \end{array} } \right. $$

Figure 2 presents the overall predictive modeling procedure. After the dataset was collected from the online Q&A community, important features were extracted from the raw data. Specifically, the feature set includes basic linguistic features (explained in Table 1), LIWC features calculated by using the software tool LIWC [10], TF-IDF matrix as BOW features, and doc2vec features (using both DM and DBOW models) trained by utilizing the Gensim package [33]. In total, 1,216 features were extracted. Then, classical machine learning methods including logistic regression, support vector machines, decisions trees, and random forests were applied to classify usefulness based on features extracted. In addition, a CNN deep learning model was directly applied to the textual data to classify usefulness of questions. Finally, all predictive models were compared in terms of their predictive performance.

Fig. 2.
figure 2

Predictive modeling procedure.

3.3 Feature Selection

The importance of all features was evaluated by applying a random forests algorithm. Figure 3 presents the importance scores of all 1,216 features.

Fig. 3.
figure 3

Feature importance.

To reduce the dimensionality of predictive models, only the 600 most important features were selected for classical machine learning modeling. Table 2 presents a summary of those most important features with their average importance scores. Among all 600 important features, 400 features are trained from doc2vec models (i.e., 200 features from doc2vec DBOW, and 200 features from doc2vec DM). This clearly shows the capability of doc2vec models in deriving important features.

Table 2. Summary of top 600 most important features.

4 Preliminary Results

Table 3 summarizes the preliminary comparison of both classical and deep learning models. Among all machine learning models compared, random forest has the highest level of accuracy (0.6918), F1 score (0.8139) and recall (0.9544), whereas logistic regression has the highest level of AUC (area under the curve of ROC, 0.6286). The CNN model that directly learns word embeddings from the textual data achieves a mediate performance. This result indeed shows the need for theoretical guidance for classical machine learning modeling. With strong theoretical bases (such as signaling theory in this study) guiding feature engineering, classical machine learning methods could outperform deep learning methods. The result also shows the prospect of deep learning methods in automatically extracting important features for textual content classification. In application situations where strong theoretical guidelines are not possible, deep learning approaches still can reach a good performance, thanks to their capabilities of automatically extracting important features.

Table 3. Comparison of predictive models.

5 Discussion

Online Q&A communities have offered an excellent opportunity for people to solve their problems without temporal and spatial constraints. To effectively seek answers, questions need to be composed in a way that can reduce the information asymmetry between knowledge seekers and potential knowledge providers. Informed by signaling theory, this research suggests that a variety of linguistic features can be used to predict the usefulness of questions submitted to Q&A communities. Specifically, this research has explored various classical machine learning and deep learning methods for predicting question usefulness.

As demonstrated in the preliminary results in Sect. 4, this study has evaluated a set of classical machine learning methods in classifying usefulness of questions. However, only a specific CNN model was evaluated in this study. For the future work, more deep learning neural network structures (such as a simple RNN and an LSTM) will be thoroughly evaluated. Features manually extracted from textual content can also be fed to deep learning structures to test how the deep learning methods perform with those manual features. An ensemble of both classical machine learning and deep learning methods can also be further evaluated. Importantly, grid search strategy will be used to tune numerous hyper-parameters in deep learning models.

Future work can also model the prediction of question usefulness as a regression problem by applying a variety of regression models to predict the natural count of usefulness votes. Findings of this research will provide practical and theoretical implications for improving the effectiveness and efficiency of knowledge exchange in online Q&A communities. Machine learning algorithms provide a technical approach to automatically filter/rank questions submitted to online Q&A communities, without the need for usefulness voting by users. This brings rich opportunities for designing new online community features or mechanisms that can address the grand challenge of supporting effective online knowledge exchange.