Keywords

1 Introduction

The literature provides evidence that furnishing decision-makers with trustworthy information has a positive impact on both tactical and strategic decisions. The growing need to discover and integrate reliable information from heterogeneous data sources, distributed in the Web, Social Networks, Cloud platforms or Data Lakes, makes Data Quality (DQ) an imperative topic. Becoming one of the most important elements in the decision-making process, sentiment analysis is concerned with gathering, analyzing, specifying and predicting user opinions that are described in natural language for the most part. According to [1], there is a prevailing belief that the quality of social media data streams is commonly low and uncertain, which, to a certain extent, renders them unreliable for making decisions based on such data. Thus, to be used in decision-making scenarios, tweets should have a minimum quality to avoid deficient decisions. The main problem in extracting opinions from social media texts is that words in natural language are highly ambiguous.

Research Hypothesis. Our hypothesis is that errors introduced into sentiment analysis (and the consequent confidence decrease in decision making process based on sentiment analysis) are primarily attributed to the ambiguity present in the text. In our work, we use the term “ambiguity” in its more general sense: 1) The first aspect is “the capability of being understood in two or more possible senses or ways” [2] that derived from linguistic features such as poorly constructed sentences or syntactical errors [3] and, 2) “Uncertainty” [3] refers to the lack of semantic information and grounding between the writer and reader. Thus, with reference to the investigation done by [4] ambiguity could be classified into “syntactic” and “semantic” metrics. For this, our main research questions are the following:

  • How can we assess the TQ of streamed tweets in real and in batch time ?

  • What are the relevant metrics and indicators to measure in order to ensure TQ?

The aim of our research is to provide automated assistance for assessing the quality of textual data. To be used for different goals in different situations, context had to be given to data quality which means that data quality dimensions and metrics should be addressed differently in each case. Besides, we think that domain experts should be involved in the analysis process. Thus, it gives more flexibility to reuse our proposal in different contexts.

The research reported in this paper targets an automatic assessment of sentiment analysis text by means of a fuzzified classifier to automatically flag ambiguous and unambiguous text at syntactic and semantic level. Our approach considers textual data and consists of: (i) involving domain expert for a contextual analysis by allowing to change the weight of quality dimension metrics, (ii) evaluating tweets using text quality metrics especially ambiguity ones at real and batch time,(iii) and storing searches in a document-oriented database in order to ensure efficient information retrieval.

This paper is structured as follows. Section 2 gives an overview of sentiment analysis, and text quality related work. Section 3 presents our contribution for text quality dimensions and metrics. We present the experimental study in Sect. 4 before concluding in Sect. 5.

2 Related Work

Many issues have been highlighted, in the field of DQ and TQ in machine learning applications, such as the noisy nature of input data extracted from social media sites [5] or insufficiency [6]. Other research on mining tweets regarding the irrelevance of data [5] and on performing sentiment analysis to discover the user’s feeling behind a tweet, have been done in crisis times [7].

A more comprehensive analysis from DQ point of view, [1] proposed a DQ evaluation system based on computing only higher DQ dimensions and metrics for data streamed in real time. A DQ approach based on three strategies for social media data retrieval by monitoring the crawling process, the profiling of social media users, and the involvement of domain experts in the analytical process is advanced by [8]. [9] enhanced TQ through data cleansing model for short text topic modeling. However, most of the previous studies advance the DQ assessment as a crisp process based on quantitative data or statistical function which can reinforce difficulties for interpreting quality measure.

Other studies have considered that textual data couldn’t be processed as certain input data [10]. For this, to handle uncertain and imprecise data, fuzzy ontology to assess the quality of linked data [11] and fuzzy knowledge-based system that combines the domain knowledge of an expert with existing measurement metric [12] were advanced.

Nevertheless, these approaches do not dive into rudimentary DQ dimensions and metrics and are closely tied to their context making their reuse heavy. We think that the challenges of TQ assessment remain into proposing an automatic evaluation approach having these main features: (i) adaptable and reusable according to the context of deployment through expert’s involvement, (ii) extensible allowing the mashup of multiple fuzzy data sources, (iii) visualizing results at real and batch time, (iv) and based on hierarchical definition of multi-level quality dimensions and metrics explained in the following section.

3 Fuzzy Based Text Quality Assessment for Sentiment Analysis Approach

This section introduces our innovative automatic assessment approach that relies on a fuzzy tree classifier explained in Sect. 3.2 and a hierarchical definition of TQ dimensions and metrics introduced in Sect. 3.1.

3.1 Underlying Quality Model

Based on the proposed hierarchical definition of quality and its indicators in [13], we suggest enriching data quality metrics definition with text ambiguity metrics and context management as shown in Fig. 1. When dealing with text quality assessment, three main levels could be identified: word, sentence, and discourse level. Quality evaluation needs to be spread over these abstraction levels and consider the decision-making context. Besides, the hierarchical decomposition of the ambiguity concept into quantifiable indicators affecting the quality of the text could be adapted and adjusted according to different viewpoints.

Fig. 1.
figure 1

Text quality dimensions

For this purpose, we had to identify the discriminating features of the text that characterize the quality of social network text from syntactic and semantic point of view. We propose in Table 1, a formal definition of adopted syntactic and semantic ambiguity metrics. These metrics should be weighted by domain experts. We think that this proposal would provide : (i) flexibility, since domain experts can adapt to context variations, (ii) generality, since they can include many particular context-dependent cases, and (iii) richness, leading to include more aspects to the metric.

Table 1. Description of ambiguity text metrics

3.2 Fuzzy Tree-Based Classifier for Text Quality Assessment

To be used for different goals in different situations, data quality dimensions and metrics should be addressed differently in each case. For this, domain experts should be involved in the analysis process. Thus, it gives more flexibility to reuse our proposal in different contexts. So, our TQ assessment approach is based on the computing of TQ weighted metrics regulated by activation factors considering the context of deployment.

The TQ assessment, as depicted in Fig. 2, involves a two-phase process. The first pre-processing phase is elementary to establish necessary data for the quality computing phase. Hence, the pre-processing phase aims to set up (1) the weighted and activation set for TQ parameters and, (2) the conflict resolution when more than one expert are involving in the analytical process.

Based on those pre-established configuration and parameters, the quality assessment phase is divided into two main steps which are:(1) the computing of fuzzy metric and, (2) the inference of fuzzy decision tree, detailed as follow.

Fig. 2.
figure 2

Text quality evaluation approach

3.2.1 Pre-processing Phase

This phase is elementary to establish necessary data and parameters for the run-time execution of the system. In this section, we present our approach for weighting the importance of text quality indicators. Our goal is to evaluate the importance of every indicator for the inference of a given text quality evaluation. This phase consists of two sub-phases. The first one, “Text metrics evaluation” is based on the knowledge of the domain experts; it deals with:

  • First, the establishment of metrics’ importance weighting and their relationship for high, intermediate and rudimentary levels. As the rudimentary metrics may not have the same importance for an intermediate metric for a given context, a weighting coefficient is used to reflect the relevance score of a given metric Mh,i to the intermediate metric Mh+1,i.

  • Second, an activation function is defined to decide whether a metric should be activated or not. This function aims to transform the weighted metric into an output value to be fed to the next layer.

The second sub-phase is the “Conflict management”. Our approach is based on aggregating the weights accorded by several experts. Thus, in order to handle imprecise and conflicting experts’ opinions, we apply the Evidence theory (also known as Dempster-Shafer Theory). It is a general framework for reasoning with uncertainty, with understood connections to other frameworks such as probability, possibility and imprecise probability theories. [14].

Given the problem of evaluating the text ambiguity associated with a given context, the universe of discourse \(\Theta \) of the evidence theory would be seen as the set of all possible metrics for syntactic ambiguity evaluating (respectively semantic ambiguity).

The power set of \(\Theta _{syn}\) noted as \(2^{\Theta _{syn}}\), consists of all the subsets of \(\Theta _{syn}\) such that: \(\Theta _{syn} = \{\Theta _{1}^{syn}, \Theta _{2}^{syn}, \Theta _{3}^{syn}, \Theta _{4}^{syn}, \Theta _{5}^{syn}\}\).

Accorded weight and function activation for each metric per each expert Ei is expressed using evidence mass function \(m_i^{syn}(x)\) known also as basic probability assignment such that:

$$\begin{array}{ccccc} m_i^{syn}(x) &{} : &{} 2^{\Theta _{syn}} &{} \rightarrow &{} [0,1] \times [0,1] \\ \end{array}$$

To access the percentage coefficient of the metric \(\theta _i\), we define the function \(per(m_i)\) where:

$$\begin{array}{ccccc} per(m_i) &{} : &{} [0,1] \times [0,1] &{} \rightarrow &{} [0,1] \\ &{} &{} (x,y) &{} \mapsto &{} x \\ \end{array}$$

Moreover, to access the percentage coefficient of the metric \(\theta _i\), we define the function \(act(m_i)\) where:

$$\begin{array}{ccccc} act(m_i) &{} : &{} [0,1] \times [0,1] &{} \rightarrow &{} [0,1] \\ &{} &{} (x,y) &{} \mapsto &{} y \\ \end{array}$$
$$ {\left\{ \begin{array}{ll} m_i^{syn}(\varnothing ) = (0,0) \\ \sum \limits _{\begin{array}{c} A \in 2^{\Theta _{syn}} \end{array}}per(m_i^{syn}(A)) = 1 \end{array}\right. } $$

Then, each expert is objectively weighted according to the similarity of his/her opinions with others experts opinions by means of evidence distance as given in

$$\begin{aligned} m_{1,...,s}^{Aver}(X) = \frac{1}{s} \sum \limits _{\begin{array}{c} i=1 \end{array}}^{s}m_i(X) \end{aligned}$$
(1)

where \(m_i(X)\) are the representation of mass functions.

The measure of conflict between an expert Ei and all the other set of experts is:

$$\begin{aligned} conf(j,\varepsilon ) = \frac{1}{n-1} \sum \limits _{\begin{array}{c} e=1 \end{array}}^{n}conf(j,e) \end{aligned}$$
(2)

Finally, adjusted scores are combined to generate the weighting coefficient using the Dempster’s combination rule for combining two or more belief functions [15].

3.2.2 Quality Assessment Based on Fuzzy Decision Tree Inference

To assess TQ ambiguity, we investigate the hierarchical representation of metrics and fuzzy logic inference. We need to extend different fuzzified values of rudimentary metrics (a subset U) to intermediate or high-level metrics (which are fuzzy subset). Thus, we chose the extension principle that is in fact a special case of the compositional rule of inference.

The extension principle, described by [14] is a general method for extending crisp mathematical concepts to address fuzzy quantities. It is particularly useful in connection with the computation of linguistic variables, the computing of linguistic probabilities, arithmetic of fuzzy numbers, etc. We applied this theory to deduce metrics value in the higher level of ambiguity tree. Thus, the extension principle is defined:

$$\begin{aligned} \mu _B(y) = sup \{ min (\mu _\phi (x,y) , \mu _A(x) / x \in E \} \end{aligned}$$
(3)

where:

  • A is the set which includes syntactic ambiguity metrics \(M{_1}\), \(M{_2}\), .., \(M{_n}\), C is the set which includes semantic ambiguity metrics \(M{_1}\), \(M{_2}\), .., \(M{_n}\) for a given level.

  • B is the set which includes fuzzy data type used to represent the text ambiguity degree of a given text A=“Very High ambiguity”, “High ambiguity”, “Normal ambiguity”, “Low ambiguity”, “Very Low ambiguity”.

  • \(\phi \,\text {is a function that associates x} \in \text {A to y}\in B, \; \phi (x)=y\)

To explain the fuzzification part, a metric \(M_i\) and a threshold value \(M_{th,i}\) is fixed by experts for a text T in a given context. The max between \((M_i - M_{th, i})\) and 0 is considered. Then, the determined value is treated by a sigmoid function to compute the ambiguity level. For example, if \(M_1 = 0.3 \text { and } M_{th_1} = 0.1\). The result is: \(max(0.3-0.1, 0) = 0.2\). Finally, passing by the sigmoid function, the obtained result is \(\mu _{M1}(T) = \text {Very Low}\).

4 Experimental Study

This section presents the data collection and acquisition process and quality computing result before evaluating the quality model.

4.1 Data Collection and Acquisition

We leverage a meticulously curated dataset sourced from Kaggle [16] that is structured with two pivotal columns: “text”, which contains the text of the tweets, and “sentiments”, which indicates the sentiment of the tweets and ranges between −1, 0, and 1. To enrich our data repository, we seamlessly integrate the Tweepy Python library to our developed interface allowing experts to customize the weight of each metric and to choose the subject of scrapped data as shown in the Fig. 3.

Fig. 3.
figure 3

Configuration interface

Fig. 4.
figure 4

Evaluation of text ambiguity

4.2 Quality Computing Results

The goal of the experiment is to illustrate how the quality of a Twitter stream can be assessed using the dimension and metrics presented in Sect. 3. Sentiment analysis in Covid-19 vaccine is token as a case study to illustrate the usefulness of our approach.

Sentiment Analysis Models Evaluation. The Table 3 presents the evaluation results of 4 sentiment analysis models which show that LSTM has the best accuracy and f1-score.

Forecasting Models Evaluation. Three forecasting models with this data were trained and the evaluation results are shown in Table 2: AUTO ARIMA forecasting model has the lowest value of RMSE (Root Mean Square Error).

Table 2. Evaluation of forecasting models
Table 3. Evaluation of sentiment analysis models
Table 4. The effect of text quality on sentiment analysis models

4.3 Quality Model Evaluation

We evaluated the quality of 200 texts which present more than 50% of very high ambiguity as shown in Fig. 4. We trained 3 ML algorithms with and without very high ambiguous data. The obtained results, shown in Table 4, ensure that TQ is one of necessary exigences to get better results. Despite the limited quantity of texts used for training sentiment analysis models (which accounts for the relatively low accuracy and F1-score values), the removal of high ambiguous data induce an improvement in the performance of the sentiment analysis models.

5 Conclusion and Future Directions

In light of the growing concern surrounding data quality in sentiment analysis for decision-making, this research presents an automatic text quality approach that can be scalable and applicable to machine learning applications within different contexts. By leveraging the principles of the data quality model, evidence theory, and fuzzy logic reasoning, we can improve the accuracy and reliability of sentiment analysis algorithms. The key contributions of this research are as follows: (1) a hierarchical decomposition of the text quality model tree to address both syntactic and semantic ambiguity, (2) contextual weighting of metrics by experts and conflict management, and (3) fuzzified quality inference by integrating weighted metrics evaluated at both low-level and high-level measurements. We believe that this proposal can be gradually enhanced by integrating additional DQ dimensions and metrics. Furthermore, the system architecture has the potential to be enriched with intelligent features and components that facilitate the derivation of contextual recommendations.