Introduction

In recent years, microblogging has grown in popularity as a means of communication and knowledge discovery, with Twitter being one of the most well-known microblogging sites, with over 500 million active users [1,2,3,4,5,6,7]. Tweets, retweets, account users, bios, geo locations, user language, link, screen name address, individual friend counts, and entity strings are just a few of the information available on Twitter. This information is tied to users and tweet messages. Tweets and Twitter users were both influenced by one another [8,9,10]. Each tweet has an author, a post, a one-of-a-kind id, the date and time it was tweeted, and, on rare occasions, the user's spatial position data. The majority of users are needed to have a Twitter handle, an identifier, followers, and, in certain cases, a bio. Tweets are the fundamental components of each Twitter account. Twitter, too, uses “status alerts.” Username, timestamp, and message are among the “root-stage” attributes of the tweet functionality. Identity, in this meaning, denotes a distinct identification. The tweet and timestamp both provide UTC time, indicating when the tweet was sent, and the text contains the UTF-8 text of the status update. The post is explained using the metadata from the user's public Twitter account. A user could be anyone or anything. Each user's metadata has a lot in common. The account identity is one of the few fields that will never change when the account is being set up.

The contents of tweets and user bios were analysed to reveal how the two have influenced one another in terms of finding high-quality subjects. In the Tweet text, which is blended with inferred traces [11], short and less than 140 characters (now 280 characters) are employed. Twitter profiles are less than 160 characters long and provide precise information about what the user truly does, allowing one to pitch Twitter users' true identities [12]. Users of Twitter can tell the world about themselves in far less than 160 characters. This box can be filled in a variety of ways, including expressing a desire, an interest, or other personal information, as well as presenting facts like age, family status, location, or occupation. In this bio description, users have a few things in common, and they use specific phrases to express their function or occupation. Their accomplishments are highlighted in flattering Twitter profiles, which characterise them as legitimate and humanising. It may include an invitation to invite others to join them. It will tell more about the user's desires and affiliations, similar to tweets [13, 14], and it will be exciting to learn more about this.

The best technique for analysing short texts is topic modelling, according to a prior study [15,16,17,18,19,20,21,22,23,24,25,26]. In this study, Twitter topic models, such as LDA and CTM, were used to review tweets that were topically equivalent, to receive documents that were not only convenient but also topically coherent, and where better topics were discovered, using a Twitter user's bio and data on tweets for each model with the number of topics 10, 20, and 30, and hashtag and user pooling schemes have been used to review tweets that were topically equivalent. The computation of Ward-D2 and Euclidean distance methods on hierarchical clustering on LDA and CTM topic models produced a collection of dendrograms for bios and tweets text to evaluate and compare the subjects given by topic modelling. The entanglement value was calculated after visual comparison of the dendrograms. Dendrogram correlation matrices have also been computed. The findings showed that both the user bio and the tweet text had an impact on topic quality discovery.

The most significant contributions of this research are;

  • Twitter topic models, i.e., LDA and CTM have been implemented with Twitter user’s bio and data on tweets for each model with the number of topics 10, 20, and 30.

  • On each model, the topics are discovered, and dendrograms are built to evaluate them.

  • On dendrograms, the various metrics were analyzed to show that both the user bio and the tweet content had an impact on topic quality discovery.

  • On both tweets and bios, the performance of topic models was investigated.

Related Work

Semertzidis investigated how Twitter users express themselves in their profile bios and discovered that the analysis of user bios is capable of predicting the linkages between Twitter users [27]. Wagner investigated several types of user-related information, such as tweets, retweets, and user list memberships, to gain a better understanding of Twitter users' expertise [28]. These studies use quantitative methods to characterise Twitter users by assessing user-related information such as tweets, retweets, and biographies; however, qualitative technique is not explored. The proposed study utilises both qualitative and quantitative methodologies to achieve better outcomes.

Rodriguez proposed a fuzzy logic-based followee recommendation mechanism on Twitter. This system approaches recommendation as a link prediction problem and makes use of three types of resemblance between two users: similarity of tweets, similarity of followee ids, and similarity of followee tweets. These commonalities are computed in the extraction of user profiles [29]. Tran developed a hashtag recommendation approach based on a study of tweet content, user attributes, and very popular Twitter hashtags, which dramatically enhances the effectiveness of hashtag recommendation systems [30]. Corcoglioniti proposed a recommendation system that uses attributes ranging from core social media features to specialised domain relevant user profile traits inferred from knowledge using machine learning methods [31]. The approaches used in these researches solely investigate Twitter user profiles or tweets; however, they do not evaluate tweet content and user profiles in tandem, and hence lose out on discovering appropriate topics. In order to uncover high-quality topics, it has been assessed the influence of user profiles on tweet content.

Ding investigated methods to obtain new knowledge exposing people's preferences from Twitter bios. A progressive labeling model has been trained by autonomously created labeled facts to establish a messy training sample set using a series of seed sequences and intuitive criteria, and then a CRF model is trained on this labeled data set. They also investigated the relationship between interest tags derived from user bios and tweet text using tf-idf frequencies of lexical items as traits and discovered a weak association between them, indicating that bios might possibly serve as a complementary resource of data to tweets. In contrast, the proposed work implies that tweet content and user profiles have a strong connection to discovering quality topics [32].

Jones presented a publicly accessible pooled database as an examination of temporal changes in individually articulated identity; the collection permits evaluation of the predominance of phrases preferred by Twitter users in the United States of America for inclusion in those bios. They employed longitudinal online profile sampling approach for examining people’s preferences while identifying individuals using terms [33]. Rogers examined Twitter bios and found that the average American user is progressively incorporating politics into their social identity. They employed longitudinal online profile sampling to provide measurable insights on how people change their identities over time [34]. Pathak utilized phenomenological strategies to discover and describe the idea of a distinctive identifier, that's essentially indicative of how identity is conveyed in Twitter bios. In addition, a strategy for extracting all individual identifiers contained in a given bio was developed. In addition, the researchers assessed the dependability, authenticity, usefulness, and usefulness of employing terms derived from Twitter bios to explore socioeconomic ethnicity [35]. These recent studies used longitudinal online profile sampling approaches to analyze Twitter users' biographies and tweet content for public social identification; however, they did not use any unsupervised learning methodologies and validation methods to discover quality content, which the proposed method performs.

Materials and Methods

Figure 1 depicts the conceptual model of this research, which includes data preparation, pre-processing of data, specification of the topic model, selection of the topic model, and visualization of the topic modeling, as well as the construction of dendrograms, performance investigation of the topic modeling, and applications. The following sections discuss each phase.

Fig. 1
figure 1

The conceptual model of this research

Dataset

The dataset #App (tweets and bios) have been gathered, based on hashtag and user pooling schemes with the support of Twitter Streaming APIs and R library. The Twitter Streaming APIs gave access to all tweets as they published on Twitter which has sent out the real-time tweets. The R library has supported to make a connection to Twitter Streaming APIs, which extracted tweets with specific search terms (hashtag, #), language, duration etc. A hashtag (#) is a Twitter tradition used to streamline inquiry, indexing and pattern disclosure. Users can incorporate exceptionally designed terms that begin with # into the body of every post. For the proposed work, the dataset has been collected with hashtag #App, language is English and the duration is between 29th June’19 and 22nd July’19 in which 1,06,868 tweets have been gathered from Twitter. After the data have been gathered based on hashtag, ten thousand users have been grasped by using R libraries in which 4,000 user’s bio have been collected.

Data Pre-Processing

Data pre-processing enabled the production of high-quality text categorization while also reducing computing complexity. On the dataset, the following pre-processing techniques were used: remove the "RT" (retweet), links, hashtags, punctuation, white spaces, stop words, and convert text to lower case. Stop words in this context are words that convey an interfacing capacity in the sentence, such as relational words and prepositions like "the," "is," "at," "which," and "on." The document-term matrix (DTM) was generated after pre-processing to build the topic model.

Topic Models

The objective of topic modeling is described in the introduction as inevitably finding the topics in a set of documents. The arrangement of topics per document and terms per topic is the topic structure's concealed structure. A document often covers a number of topics in diverse degrees of depth. Topic modeling approaches establish categories of comparable terms as topics [36,37,38]. A topic model encapsulates this intuition in a statistical model that enables for the assessment of a series of documents and the discovery of how well the topics may be relying on the statistical of terms for each, as well as the proportion of topics in each document. The most efficient experimental challenge for topic modeling is inferring the concealed topic structure from the observed texts. In this work, the topic models such as LDA and CTM have been implemented on both datasets, i.e., Tweets and Bios.

Latent Dirichlet Allocation (LDA)

In latent Dirichlet allocation (LDA), the topic model is a part of the larger field of probabilistic modeling. LDA could be described further legitimately with the representation below [15]. The topics are β1:t, where every βt is a jargon distribution. θs is the topic scope of the sth document in which θs,t is the topic scope for the topic t in document s. cs is the sth document of topic consignment in which cs,m is the topic undertaking for the mth word in document s.

There, os are the observed words of document s, wherein os,m is the mth word in document s, it is an element of the stable jargon. The LDA generative method focuses on particular accumulated dispersion of the concealed and observable factors with this notation (Eq. 1),

$$p \, \left( {\beta_{1:T} , \, \theta_{1:S} , \, c_{1:S} , \, o_{1:t} } \right) = \mathop \prod \limits_{i = 1}^{t} p( \beta_{i} ) \mathop \prod \limits_{s = 1}^{S} p(\theta_{s} )\left( {\mathop \prod \limits_{m = 1}^{M} p(c_{s,m} | \theta_{s} } \right)p(o_{s,m} | \beta_{1:T} c_{s,m} )$$
(1)

It is to be kept in mind that this distribution specifies some of the dependencies. For example, the assignment of the topic os,m depends on the scope of the topic for each document θs. The example of this is the term os,m which is reliant on the topic consignment cs,m and all topics β1:t

The number of topics should be pre-set in advance for fitting the LDA model to a given document-term matrix. Further, the appraisal, utilizing Gibbs sampling, needs the information of values for the parameters of the prior dispersals, i.e., 10/k for α, the topic distribution of documents θ, drawn from a Dirichlet prior with parameter α and 0.1 for ϕ, each ϕ is drawn from a Dirichlet prior with parameter β, the term distribution of the topics.

Now looking into computational problem, the provisional dispersal of the topic structure present in the observed documents is computed. This is called the posterior. Using the notation, the posterior is, Eq. 2,

$$p\left( {\beta_{1:T} , \, \theta_{1:S} , \, c_{1:S} | \, o_{1:S} } \right) = \frac{{p\left( {\beta_{1:T, } \theta_{1:S} , C_{1:S} , O_{1:S} } \right)}}{{p\left( {O_{1:S} } \right)}}$$
(2)

In the equation above, the combined dispersal of all contingent factors is the numerator which is the hidden factors’ impact. The minimal probability of the remarks is the denominator in the implemented topic model.

Correlated Topic Models (CTM)

The dispersals of topics and terms for specified documents are characterized by the topic models. The topic quantities for a document s is θs, which is a vector of all possible topics T, considered for the topic dispersal. The correlated topic model makes it possible to compare topics by pretending a multivariate dispersal for a transmuted version of θs. The data come as the occurrence of the word, i.e., the number of times each word from a certain jargon seems in a specified document is significant to the topic model [39]. It is to be supposed that S documents, each having Ms words are created from a jargon of J words. The mth word in document s is denoted os,m ∈ {1,…,J}, s = 1,…,S, m = 1,…, Ms. Understanding the topic attributed to that word, os,m accompany a multinomial distribution over the vocabulary of J words. The topic itself is assigned, cs,m follows a multinomial distribution over the possible T topics. The chance of the term seems in a document is commendably resolute by the term occurrences, and the likelihood of the topic is also a hidden variable to be assessed.

In the proposed work, an LDA is fitted for demonstration with 10 topics, utilizing Gibbs sampling with the appropriate values have been set to control parameters. Wherein the initial iteration has been set as zero and then every1000 iterations have been returned for 1000 cycles. The underlying α has been set to 10, 20, and 30/k, and the best model regarding the log-likelihood log(p(o|c)) has been observed when Gibbs sampling is returned.

The CTM is fitted for demonstration with 10 topics, utilizing the variational inference and Expectation–Maximization (EM) algorithm with control parameters setting as 500 to the maximum number of iterations. The tolerance for the relative change in the likelihood set as 0.0001 for variational inference step as well as for the EM algorithm. And for the same parameters, the values have been set as 1000 and 0.001, respectively, trying to optimize the fit to the data observed. The resultant approximate mean entropy is given below.

Tables 1 and 2 show the mean entropy of a number of topics 10, 20, and 30 rendered by the LDA and CTM topic models on dataset #App (tweets) and #App (bios), respectively. It shows that topics and mean entropy values are instantly proportionate to the number of topics and that the greater value of the fitted topic model relatively suggests that the distribution of topics is sensible to the dataset.

Table 1 Mean entropy of 10, 20, and 30 topics from the fitted topic models LDA and CTM on Tweets dataset
Table 2 Mean entropy of 10, 20, and 30 topics from the fitted topic models LDA and CTM on Bios dataset

Tables 3 and 4 show the sample topics rendered by the implementation of LDA with number of topics 10 on dataset #App (tweets) and #App (bios), respectively. From Table 3, the topic 1’s some terms include “dayforce,” "reuters," and "hotschedules" that recommend distinctive apps. The topic 6’s some terms include "moovit," "vyclone," and "careerbuilder" which recommend other sets of apps. From Table 4, the topic 3’s some terms include "hacker," "testers," and "developer" that suggests some identity. The topic 9’s include "musician," "traveling," "evofuse," and "networking" which recommends other sets of identity.

Table 3 Sample topics from the fitted topic model LDA on Tweets dataset
Table 4 Sample topics from the fitted topic model LDA on Bios dataset

Tables 5 and 6 depict the topics rendered by the implementation of CTM with number of topics 10 on dataset #App (tweets) and #App (bios), respectively. From Table 5, the topic 1’s some terms include "ziprecruiter," "mobileiron," and "jobseeker" that recommend distivtive apps. In the topic 9’s some terms include "shopping," "vyclone," and "kronos" which recommend other sets of apps. From Table 6, the topic 6’s some terms include "developer," "calendars," and "icbc" that suggest some identity. The topic 9’s include "musician," "appanalytics," and "hacker" which recommends other sets of identity.

Table 5 Sample topics from the fitted topic model CTM on Tweets dataset
Table 6 Sample topics from the fitted topic model CTM on Bios dataset

Results and Discussion

The topics rendered by the topic models LDA and CTM are reviewed and contrasted in the proposed work, which involves creating dendrogrms. Dendrograms are a form of "bottom-up" agglomerative hierarchical clustering. To put it another way, each entity (topic) has only one item cluster at first (leaf). The two very similar clusters are merged into a huge new cluster at every stage of the set of rules (nodes).

The similarity between the objects was determined using the Euclidean distance method. A distance or dissimilarity matrix is the result of this calculation. The linkage function then uses the distance information to group things into clusters based on their resemblances. The construction of new clusters is then connected to one another to form larger clusters. This method is repeated until a hierarchical tree is formed with all of the objects in the actual dataset. Various cluster agglomeration methodologies exist (i.e., linkage strategies).

The most often used linking methods are full or full linkage, minimum or single linkage, mean or average linkage, centroid linkage, and Ward's minimum variance method. In this study, Ward's method, as well as the Euclidean distance method, are recommended. In the dendrogram below, each leaf represents a single entity. The (dis)similarity/distance between two objects/clusters is shown by the vertical axis height of the merge. The higher the elevation of the mixture, the less comparable the objects are. The cophenetic distance between the two objects is defined as this height.

The distances (i.e., heights) within the tree should be calculated after connecting the items in a dataset into a hierarchical cluster tree, portraying the actual distances as they should be. The cophenetic distances and the actual distance data obtained with the distance method are then compared to see how well the cluster tree performed. The initial distance matrix's distances between points have a strong correlation with the cluster tree's object linking. Clustering validity has been kept. The value of the correlation coefficient that is closest to 1 is right, whereas the value found is 0.75, which is considered appropriate.

On each #App (tweets) and #App (bios) dataset, the process began with the generation of a collection of two dendrograms using the Ward-D2 method of hierarchical clustering on LDA and CTM topic modelling. Figures 2, 3, 4, and 5 show them individually.

Fig. 2
figure 2

Display the dendrogram of 10 topics from the fitted topic model LDA on Tweets dataset

Fig. 3
figure 3

Display the dendrogram of 10 topics from the fitted topic model LDA on Bios dataset

Fig. 4
figure 4

Display the dendrogram of 10 topics from the fitted topic model CTM on Tweets dataset

Fig. 5
figure 5

Display the dendrogram of 10 topics from the fitted topic model CTM on Bios dataset

Figure 2 shows that the topics are grouped into 3 clusters, the topics 8, 9, and 10 are grouped as cluster 1, the topic 2 grouped as cluster 2 and the topics 1, 3, 4, 5, 6, and 7 are grouped as cluster 3, respectively. Among the 3 clusters, the topics 5, 6, 7, 8, 9, and 10 are almost of the same distance. Fig. 3 shows that the topics are grouped into 3 clusters, the topics 8, 9, and 10 are grouped as cluster 1, the topic 2 grouped as cluster 2 and the topics 1, 3, 4, 5, 6, and 7 are grouped as cluster 3. Topics 5, 6, 7, 8, 9, and 10 are almost the same distance apart among the three clusters. Figures 2 and 3 show that the cluster dendrogram of 10 LDA topic model topics on dataset #App (tweets) and the cluster dendrogram of 10 LDA topic model topics on dataset #App (bios) are almost identical.

Figure 4 shows that the topics are grouped into 3 clusters. The topics 1, 3, 6, 7, 8, and 9 are grouped as cluster 1, topic 5 grouped as cluster 2 and the topics 2, 4, and 10 are grouped as cluster 3, respectively. Among the 3 clusters the topics 3, 4, 9, and 10 are almost of same distance. Figure 5 shows that the topics are grouped into 3 clusters, the topics 3, 4, and 9 are grouped as cluster 1, topics 2, 6, and 8 are grouped as cluster 2 and the topics 1, 5, 7, and 10 are grouped as cluster 3, respectively. Among the 3 clusters the topics 3, 6, 7, 8, 9, and 10 are almost of the same distance. From both Figs. 4 and 5, it is understood that the cluster dendrogram of 10 topics of CTM topic model on dataset #App (tweets) and the cluster dendrogram of 10 topics of CTM topic model on dataset #App (bios) are distinctive. After computing hierarchical clustering, two dendrograms are compared visually and a correlation matrix between the dendrograms is computed.

In an experiment, the two dendrograms are visually measured and plotted beside each other, having respective labels bound by arcs. Based upon on measures of an entanglement value among tree, the quality of the alignment or separation has been measured. Entanglement is calculated by assigning values to the left tree's labels ranging from 1 to tree length and then matching such values to the right tree. Entanglement is defined as the L norm distance among both two feature vector, L indicates the panelty level; for acute angles, choose a large panelty. It was evaluated between 1 (complete entanglement) and 0 (no entanglement).

The reduced coefficient of entanglement shows an effective alignment. In the diagram, the highlighted line indicates unique nodes. Figures 6 and 7 show that the entanglement coefficient of cluster dendrograms of dataset #App (tweets and bios) on 10 topics of LDA topic model is 0.02 and 0.64 for cluster dendrograms of dataset #App (tweets and bios) on 10 topics of CTM topic model, i.e., LDA has 99.98% and CTM has 36% good, indicating that LDA has outperformed CTM, which has proved that tweets and bios are mutually influenced to discover the topics. The entanglement coefficient for the number of topics 20 and 30, as shown in Table 7, has also been calculated. As a consequence, it was recognized that LDA is performing better for the number of topics 10, 20, and 30. The same could be interpreted with the outcomes shown in Figs. 2 and 3, respectively.

Fig. 6
figure 6

Comparison of cluster dendrograms of 10 topics from the fitted topic model LDA on Tweets and Bios datasets

Fig. 7
figure 7

Comparison of cluster dendrograms of 10 topics from the fitted topic model CTM on Tweets and Bios datasets

Table 7 Entanglement co-efficient of dendrograms of topics 10, 20, and 30 built from the fitted topic models LDA and CTM on the Tweets and Bios datasets

Next, the cophenetic correlation matrix between the cluster dendrograms of #App (tweets and bios) dataset on 10 topics of LDA topic model and also the cluster dendrograms of #App (tweets and bios) dataset on 10 topics of CTM topic model have been computed. The linear connection among the variance for every range of parameters with respective associated cophenetic intervals is described by the cophenetic correlation constant. The cophenetic intervals are a metric of outgroups variances between two factors that were integrated on a similar cluster. This approach was used to determine if a dendrogram is an adequate method or not; the strong correlation (closer to 1) between the initial and cophenetic intervals indicates a high good for a particular dendrogram. The resulting values were shown in Figs. 8 and 9, respectively. Also, the cophenetic correlation coefficient for the number of subjects 20 and 30 is calculated, as shown in Table 8. As a result, it was understood that LDA performs better for the number of topics 10, 20, and 30, as the coefficient of cophenetic correlation is close to 1 for both tweets and bios dendrograms, indicating that tweets and bios are mutually influenced for the discovery of topics.

Fig. 8
figure 8

Shows the cophenetic correlation coefficient between dendrograms of Tweets and Bios datasets on 10 topics of LDA topic model

Fig. 9
figure 9

Shows the cophenetic correlation coefficient between dendrograms of Tweets and Bios datasets on 10 topics of CTM topic model

Table 8 Cophenetic correlation co-efficient of dendrograms of topics 10, 20, and 30 built from the fitted topic models LDA and CTM on the Tweets and Bios datasets

In an overview, the proposed technique was contrasted with another method, namely perplexity. Perplexity has been used as the most common method for assessing topic models [40, 41]. In the case of LDA, it is not so good because it is hard to comprehend. LDA works well when the topics are soft clustered. Sometimes it shows better on the topic model when learning works smoothly, but it shows very poor overall model quality. Therefore, it is evident that the suggested technique was better than the other technique, i.e., perplexity.

Furthermore, since the generative probabilistic model, LDA, does not have implicit clustering properties, hierarchical clustering methods have been explicitly applied, resulting in agglomerative clusters on various topics that can be evaluated uniquely in this proposed work by comparing the topics discovered from various corpora such as tweets and bios.

The limitations of the research work are the quantity of user bios and tweets, as well as the single dataset used, are too limited. And also, only three metrics with two techniques were used to assess the performance. All of these will be expanded in future work.

Conclusion

Twitter topic models, i.e., LDA and CTM, have been deployed with Twitter user bios and tweet data for each model with 10, 20, and 30 topics. A set of dendrograms for bios and tweets text has been created in order to analyse and compare the topics that have been rendered using topic modelling. The entanglement value was determined by comparing the dendrograms visually. The cophenetic correlation coefficient between dendrograms has also been computed. The findings revealed that both the user bio and the tweet text had an impact on topic quality discovery. Based on the following metrics, the LDA topic model outperforms the CTM topic model.

  1. 1.

    The mean entropy of 10, 20, and 30 topics produced by the LDA and CTM topic models on datasets #App (tweets) and #App (bios) has been computed (Tables 1 and 2). It demonstrates that the number of topics and mean entropy values are directly proportional to the number of topics, and that the higher the value of the fitted topic model, LDA reveals that the distribution of topics is responsive to the dataset.

  2. 2.

    The entanglement coefficient for the number of topics 10, 20, and 30, as shown in Table 7, was determined to assess the quality of the alignment or separation among dendrograms of dataset #App (tweets and bios) on LDA and CTM topic models. As an outcome, it was determined that LDA performs better for the number of topics 10, 20, and 30.

  3. 3.

    The cophenetic correlation coefficient was then determined for the number of topics 10, 20, and 30, as shown in Table 8. As a consequence, it was revealed that LDA performs better for the number of topics 10, 20, and 30, since the coefficient of cophenetic correlation for both tweets and bios dendrograms is nearer to one.

In future, researchers intend to contribute the proposed method to additional Twitter topic models that incorporate more information, such as follower profiles and retweets, for social identities and recommendation applications.