Keywords

1 Introduction

Nowadays, social networks have become an essential part of human life. Based on a recent research of Statista, more than 3.5 billion people on earth have at least one account on a social platform in 2019 [1]. With rapid growth of users, comes giant amount of data. This can bring a lot of opportunities for those who can discover patterns inside of the user data and find out meaningful usages of them.

In Vietnam, Facebook is the social network having the largest number of users [1]. Posts on Facebook can come from individuals (particularly famous figures) or from organizations in a form of what is called a fan page. Because of the great ability of sharing posts to the fans (i.e., people who follow pages), fan pages are playing an important role in spreading information, news, and facts on Facebook. If we can model the topics of posts on popular pages, we will have a good chance to find out trending contents on Facebook.

In recent years, there have been a lot of researches on using Latent Dirichlet Allocation (LDA) to cluster the scientific documents [2, 3] and news articles [4, 5]. For social networks document analysis, there were some studies about modeling the topic on Twitter [6, 7] or favorite topics on Facebook posts [8]. This research focuses on modeling Facebook fan pages by using the method of topic modeling from documents (i.e., the fan page’s posts).

In this paper, we propose a solution of modeling the topic of documents with LDA combined with calculating the interaction index of the Facebook posts to find an effective vector representation of Facebook fan pages. Then we apply this representation to analyze topic distribution of each fan page and to find out groups of similar fan pages. The proposed solution shows the effectiveness on clustering the fan pages into subsets by increasing the clustering performance than modeling using just LDA. The fan page representation also helps point out the similarities between fan pages and give us an idea about what is happening on Facebook in a particular period of time.

This paper is organized as follows. Section 2 reviews past studies leading to the motivation of our work. Section 3 describes our proposed solution. Experiments and results are given in Sect. 4. Section 5 presents the conclusion and future work of the current research.

2 Related Work and Motivation

Topic modeling using LDA is not a new technique in Natural Language Processing. LDA uses an unsupervised learning model, therefore it is a good technique for document classification, especially on unlabeled datasets such as social network’s textual data. There were several researches taking this advantage of LDA to model and analyze Twitter conversations [6] or favorite topics of young Thai Facebook users [8]. The main focus of these studies is the modeling and mining the topics from the text corpus of social network users. Their proposed methods can help us to obtain the topics in which a part of users interested in, for example educational workers and students at National University of Colombia [6] or students at Assumption University in Thailand [8]. However, when we wish to discover the major contents of a social network on a large scale such as finding trending topics among users of a nation, the available approaches exhibit their limitations, that is it is almost impossible to collect published data (i.e., the posts) of every person on the social network because it goes against privacy rights as well as takes a lot of time and resource to collect and process the data.

This paper tackles this problem by focusing on a class of special users that have much more impact on social networks than other individuals, which are key opinion leaders (KOLs) and popular organizations. A post of a KOL or a well-known organization, usually on their fan pages, may lead the opinions, represent for the thoughts, and attract the interests of many people which follow them on social networks. Therefore, instead of collecting data from each regular account on a social network, we only need to get and analyze data from a number of influential accounts of KOLs and organizations, thereby achieving the equivalent effectiveness in capturing the trends of the social network on a large scale. In this paper, we selected the most reputable Facebook fan pages in Vietnam for topic mining and other data analyses.

3 A Novel Solution for Facebook Fan Page Modeling

3.1 Observations

To know what a Facebook fan page is talking about, we have to analyze the contents of its posts. In this research, we are only interested in the textual part (called the document hereinafter) and the interactional part (i.e., reactions and comments) of a post. If we can extract the topics of every document in the text corpus of a fan page, we probably can figure out the most popular topics of a fan page.

Assuming that each document in a fan page’s corpus has its own topic probability distribution or, in other words, each document can be represented by a fixed-dimensional vector depending on what its content is about. For example, if a document has its topic proportions of 30% about sport, 50% about technology, 20% about politics, the topic distribution vector of this document will be [0.3;0.5;0.2]. In practice, the results of vectorizing documents are not clearly visible like the above example, but usually hidden in the textual data. We need a solution to combine the topic distribution vectors of the documents in a fan page’s corpus to find the topic distribution vector representing the fan page.

After studying about Facebook data properties, we realized that different posts (thus their corresponding documents) have different degrees of importance to the topic proportions of a fan page. The posts that receive more interactions from users are likely to contribute more to the composition of the topics of a fan page and to the distinction among fan pages.

3.2 Proposed Process Diagram

Figure 1 presents the proposed process flow. Firstly, the raw data of fan pages are collected from crawlers, from which textual data and interactional data are extracted. After that, the textual data is pre-processed by removing page signatures, special characters, icons, and stop words. Pre-processed documents are then applied LDA-based topic modeling process, returning topic distribution vectors of all documents of each fan page’s corpus. Meanwhile, the interactional data is used to calculate the interaction indices of all documents of each fan page. Finally, the vector representation for every fan page is obtained by combining the topic distribution vectors and interaction indices of all documents of the page. How the combination is done is described in details in Sect. 3.4.

Fig. 1.
figure 1

Process diagram of the proposed solution.

3.3 Topic Modeling Using LDA

LDA is a method widely employed for modeling the topics of documents in a corpus, which was proposed by Blei et al. in 2003 [9]. This method assumes that each document in the corpus is a probability distribution of topics and each topic is a probability distribution of words in the vocabulary of the corpus. Given a corpus D, LDA assumes that the corpus can be generated by the following process [10] (Fig. 2):

Fig. 2.
figure 2

Graphical representation of LDA model (modified from [9]).

  • Step 1. For each topic k in K topics, draw a distribution over words in the vocabulary:

  • $$ \varphi \left( k \right) \, \sim \, Dirichlet\left( \beta \right) \, $$
  • Step 2. For each document \( d \in D \):

  • a) Draw a distribution over topics of the document:

  • $$ \theta_{d} \sim Dirichlet\left( \alpha \right) \, $$
  • b) For each word \( w_{i} \in d \):

  • i. Draw a topic assignment: \( z_{i} \sim Discrete\left( {\theta_{d} } \right) \)

  • ii. Choose the word: \( w_{i} \sim Disctete\left( {\varphi^{{\left( {z_{i} } \right)}} } \right) \, \)

where K is the number of latent topics in the corpus and \( \alpha ,\,\,\beta \) are the parameters of the corresponding Dirichlet distributions.

The above process results in the following joint distribution [10]:

$$ p\left( {{\mathbf{w}},{\mathbf{z}},\theta ,\varphi |\alpha ,\beta } \right) = p\left( {\varphi |\beta } \right)p\left( {\theta |\alpha } \right)p\left( {{\mathbf{z}}|\theta } \right)p\left( {{\mathbf{w}}|\varphi_{z} } \right) $$
(1)

where w is the vocabulary and z is the topic assignment for each word in w.

3.4 Fan Page Representation Using Vector

Figure 3 illustrates the process to obtain the topic distribution vector for a particular document in a fan page’s text corpus by using LDA. LDA model gives us two outputs, the cluster of words for each topic and the topic assignment for each word in the corpus. Therefore, we can know exactly how many times a topic appears in the document or, in other words, how many times a topic has been assigned to any word in the document by counting. We then get the topic distribution vector of the document by calculating the probability of each topic being assigned to a word in that document. Consequently, we can generate the topic distribution vector for each fan page in some way.

Fig. 3.
figure 3

Process to obtain the topic distribution of a document by using LDA.

We propose a simple way to calculate the topic distribution vector for each fan page by summing over the topic distribution vectors of all documents in its corpus. However, each document in the sum should be associated with a weight reflecting how interactive its corresponding post is, as presented in Sect. 3.1. Therefore, we additionally propose to use the number of reactions (e.g., like, haha, angry, etc.) and the number of comments on each post as the parameters to compute the weight of that post, thus its document, in making of the topic distribution of a fan page.

Let \( V = \left\{ {v_{1} ;v_{2} ; \ldots ;v_{n} } \right\} \) the set of topic distribution vectors of the documents of a fan page’s corpus; n is the number of documents of the corpus; \( t_{i} \), \( r_{i} \), \( c_{i} \) are respectively the interaction index, number of reactions and number of comments of the ith document (\( 1 \le i \le n \)). The interaction index of the ith document can be calculated as

$$ t_{i} = \eta r_{i} + \mu c_{i} , $$
(2)

where \( \eta ,\,\,\mu \) respectively represents the relative importance between reactions and comments in the interaction index. Since comments are considered more valuable than reactions in terms of the degree of interaction, we experimentally set \( \eta = 1 \) and \( \mu = 3 \).

Let P the topic distribution vector represented a fan page. P can be calculated as the weighted sum of topic distribution vectors of all documents of the page, i.e.

$$ P = w_{1} v_{1} + w_{2} v_{2} + \ldots + w_{i} v_{i} + \ldots w_{n} v_{n} , $$
(3)

where the weight of each document is its interaction index normalized among all documents of the fan page, i.e.

$$ w_{i} = \frac{{t_{i} }}{{\sum\limits_{i = 1}^{n} {t_{i} } }}. $$
(4)

Finally, we can rewrite the vector representation of the page as:

$$ P = \frac{{\sum\limits_{i = 1}^{n} {(t_{i} \times v_{i} )} }}{{\sum\limits_{i = 1}^{n} {t_{i} } }}. $$
(5)

4 Experiments and Results

4.1 Data

The data for this project was crawled from the top fan pages that have the biggest fanbase in the Media category of Vietnamese Facebook (according to the ranking of socialbakers.com in October, 2019 [11]). Details of the dataset are described as below:

  • Number of fan pages: 100

  • Number of posts (documents): 27,226

  • Number of unique words (segmented by the pyvi toolkit [12]): 56,135

  • Total number of words: 743,725

  • Timespan: during October, 2019

4.2 Experimental Settings

All experiments were conducted using the scikit-learn toolkit [13]. We used LDA model for the document topic modeling process with the following parameters:

  • Number of topics: K = 20

  • Parameters of Dirichlet distributions: \( \alpha = \beta = \frac{1}{K} = 0.05 \)

If the number of topics is too small, there will be little diversity among topic distributions of the corpus. On the contrary, if the number of topics is too big, it will be difficult to interpret what the topics are about since the topics are not obvious anymore. Therefore, we set the number of topics K to 20 in the experiments.

4.3 Topic Modeling Results

Table 1 presents the topic modeling results based on LDA method by showing the top 10 keywords of 20 topics. We can observe that several topics represent quite well about hot events or issues happening in October, 2019. For example, Topic 8 is clearly about the football match between national teams of Vietnam and Malaysia inside the World Cup 2022 qualification round with keywords such as “việt_nam” (Vietnam), “malaysia”, “trận” (match); Topic 5 can be associated with the protest escalation in Hong Kong with the keywords such as “hồng_kông” (Hong Kong), “biểu_tình” (demonstrate), “dân_chủ” (democracy); or Topic 3 can be identified as the air pollution spike in Hanoi due to the keywords such as “không_khí” (air), “ô_nhiễm” (pollution), “hà_nội” (Hanoi), etc. Other topics about daily issues also can be easily identified from their keywords such as Topic 0 (about fashion and music), Topic 6 (about technology and cellphone), Topic 14 (about food and restaurant), to name just a few.

Table 1. Top 10 keywords of 20 topics found by LDA (English translation in parentheses).

4.4 Fan Page Modeling Results

Based on the outputs of LDA, we can infer the topic distribution vector of a document. Since a fan page can publish multiple documents with different topics, we can represent the page based on the topic distribution vectors of its documents. The page’s topic distribution vector is defined as the weighted sum of the document vectors as described in Sect. 3.4. Thus it has the same dimension of 20 with the document vectors (due to K = 20).

As an example, the resulting topic distribution vector of the fan page for “Báo Đời Sống Pháp Luật” (Law and Life Journal) is displayed in Fig. 4. As can be observed, the topic probability distribution attains notable peaks at three topics, which are: Topic 2 – a justice-related topic with the keywords such as “cảnh sát” (police), “vụ” (case), and “điều tra” (investigate); Topic 10 – a family-related topic with the keywords such as “mẹ” (mother), “vợ” (wife), and “tiền” (money); Topic 13 – a transportation-related topic with the keywords such as “xe” (vehicle), “đường” (street), and “giao thông” (traffic). This result is quite reasonable because justice, family, and transportation are the most concerns of this journal.

Fig. 4.
figure 4

Topic distribution of fan page “Báo Đời Sống Pháp Luật” (Law and Life Journal).

4.5 Fan Page Clustering Results

With the resulting vector representations of fan pages, we can group them into different clusters so that the pages in each cluster have similar topic distributions and the resulting clusters are well separated each other. We have tried to cluster the topic distribution vectors of all fan pages in the dataset with the K-mean Clustering algorithm. With the optimal number of clusters of 12 (see the results in Table 2), we got several example results as follows.

Table 2. Silhouette scores comparison between two methods of fan pages modeling.

Cluster 1 includes several fan pages such as “Giải trí TV” (Entertainment TV), “HTV3 - DreamsTV”, “Kênh Nhạc Việt” (Vietnamese Music Channel), “VTV Giải trí VTV6” (VTV Entertainment VTV6). All of these are the pages of entertainment channels (Fig. 5).

Fig. 5.
figure 5

Topics distribution vectors of the fan pages belonging to Cluster 1.

Cluster 2 includes fan pages of broadcasters about news and politics such as “BBC Tiếng Việt” (BBC Vietnamese), “Đài Châu Á Tự Do” (Radio Free Asia), “RFI Tiếng Việt” (RFI Vietnamese), “VOA Tiếng Việt” (VOA Vietnamese) (Fig. 6).

Fig. 6.
figure 6

Topics distribution vectors of the fan pages belonging to Cluster 2.

As can be seen on Fig. 5 and Fig. 6, those fan pages having similar topic distributions were grouped quite well thanks to their vector representations.

To quantitatively evaluate the clustering performance, we used Silhouette score [14] to measure how well the clusters are separated to each other. The higher the score, the better clustering process. We compared the clustering performance between the two vector representations of fan pages: our proposed method (LDA-based topic distributions combined with interaction indices, i.e., each document has a different weight in Eq. (3)) and conventional one (LDA-based topic distributions only, i.e., all documents have the same weight in Eq. (3)). The results in Table 2 show that when the number of clusters is high enough (more than 8), our proposed method outperforms the conventional one on Silhouette score. In particular, both of the two fan page representation methods achieve optimal clustering performance when the number of clusters is set to 12. In that case, our proposed method improves 9.0% on Silhouette score compared to the conventional one (0.3008 vs. 0.2759).

5 Conclusion

In this paper, we have proposed a method to represent a fan page by a vector using LDA-based topic modeling on all fan pages in the corpus combined with interaction index analysis of their posts. Experiment results showed that this representation can be used to cluster a set of fan pages effectively and obtained better clustering performance than the conventional one just based on LDA. The proposed vector representation of fan pages also showed its effectiveness in figuring out hot topics as well as regular issues posted on Facebook in a fixed period of time. The main benefit of our approach to fan page modeling and mining is that it helps us to follow trending contents on this social platform on a large scale without collecting the data of regular individual users. In the future, we will apply other models that focus more on the segmentation of documents such as lda2vec [15] to find out how positive or negative different fan pages talk about the same topic. We also want to extend the proposed method so that the time factor is included to reflect how the relationship between fan pages changes over time.