STC: A Joint Sentiment-Topic Model for Community Identification

Yang, Baoguo; Manandhar, Suresh

doi:10.1007/978-3-319-13186-3_48

Baoguo Yang¹¹ &
Suresh Manandhar¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8643))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2350 Accesses
1 Citations

Abstract

Traditional methods for identifying communities in networks are based on direct link structures, which ignore the content information shared among groups of entities. Recently, community detection approaches by using both link and content have been studied. It is necessary to identify communities with different sentiment distributions based on corresponding topics, which cannot be identified by existing community discovery techniques. To directly detect the sentiment-topic level communities and to better explore the hidden knowledge within them, we propose to integrate social links, content/topics, and sentiment information to work out a novel community model. Experimental results on two types of real-world datasets demonstrate that our model can not only achieve comparable performance compared with a state-of-the-art community model, but also can identify communities with different topic-sentiment distributions.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Community Mining and Cross-Community Discovery in Online Social Networks

Community Detection Through Topic Modeling in Social Networks

Local Community Detection Using Social Relations and Topic Features in Social Networks

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The rapid growth of social medias provide us more chance to contact with other people and share our interests and opinions online, such as Facebook, Myspace, Twitter, etc. Email is considered as another kind of communication tool, which brings us more convenience to send or receive messages. A huge amount of data are generated online every day. Discovering previously unknown knowledge and relationships among people is very useful and necessary for individuals and organizations.

Example 1 (Email Networks): Email is widely used in our daily life, especially in companies and universities. Email correspondence produces abundant social messages associated with social relations. For teachers, their email recipients can be students, colleagues, friends, family members, librarians, and book publishers, etc. To get a high-level overview of the emails in our mailboxes, it is very interesting and necessary to discover our social communities in an automatic way. In each community, we are interested in the topics we discussed, people we contacted with, and the sentiment on some topics. Such information is latent and unobservable.

Example 2 (Hotel Twitters): Twitter, a popular microblogging platform, is not only used by individuals, but also very popular in many organizations, such as companies, hotels, and online supermarkets. As we know many hotels have their own twitter accounts. The customers can send their tweets about opinions and reviews to the hotels, and can comment on other tweets about the environment, food, and service of the hotels. To make full use of the data, it is useful to automatically identify communities associated with this twitter account. The communities with obvious negative polarities should be considered firstly. The hotel managers can take actions to address the main issues these customers proposed, and then response to these groups of people about the quality improvement of the hotel to win more customers, and to avoid the negative information proliferation across communities. Note that if we only extract collections of tweets including same sentiment topics by using traditional sentiment analysis methods instead of mining communities, the important social links will be ignored.

Based on the above examples, it is demanding to devise an effective community discovery approach to tackle these issues. The research on communities has a long history, and it has been paid widely attention in the past decade. In [2, 9], Girvan and Newman propose a popular divisive community detection algorithm based on the concept of betweenness. To improve the speed of the algorithm in [2], a modified algorithm is proposed by Tyler et al. in [15]. Also some overlapping community detection methods has been proposed, like [4, 17]. In addition, dynamic community discovery has been studied in recent years [3, 10], where communities are not static but evolve over time.

However, most of the existing community identification methods intend to learn the community structures just using links, which ignore the content information in social networks. In recent years, the research on community detection has attracted increasing attention and achieved great progress. Discovering communities by combining link and content has been proposed in the literature [12, 14, 18–20], however, these methods fail to consider the valuable sentiment information in social networks.

In this paper, we propose a novel S entiment-T opic model for C ommunity discovery, called STC, which is built by using social links, topics and sentiment in a unified way, where the sentiment is studied based on its corresponding topic. The main goal of this approach is to discover sentiment level communities, i.e., to find out some communities containing dominant sentiments on certain topics even though not all communities have dominant sentiment topics. In our model, we define a community as a collection of people who are directly or indirectly connected and share some sentiment topics with some members in this collection. Note that not all the topics are discussed by every member of the community, also not all the members have the identical sentiment towards a certain topic, and the connectivity among members is also a very important factor. In many cases, even if two groups of people have similar sentiment-topic distributions, they are not included in the same community when the two groups follow different user distributions.

The rest of this paper is organized as follows: Sect. 2 introduces the related work. We present our community discovery model, the generative process and parameter estimation in Sect. 3. In Sect. 4, we present and discuss the experimental results on two real-world datasets, the comparison with an up-to-date model is also reported. We give short discussion in Sect. 5, and the conclusions with future work are presented in Sect. 6.

2 Related Work

Traditional algorithms are focused on identifying disjoint communities [2, 9], while in many real-world networks communities are allowed to overlap to some degree, where an entity can be included in multiple communities. The clique percolation method proposed by Palla et al. [11] is an early technique for overlapping community detection. Later, many algorithms have been proposed to improve the performance of the detection methods, such as OSLOM [4], SLPA [17], etc.

The above mentioned community identification methods ignore the content of social interactions in social networks. An early framework for community discovery using link and content elements is proposed in [19], the authors proposed two community-user-topic (CUT) models based on joint user and topic distributions. In [18], Yang et al. propose to integrate a popularity-based conditional link model with a discriminative content model into a unified framework to discover communities. For maximum likelihood inference, a novel two-stage optimization algorithm is proposed.

CART (Community-Author-Recipient-Topic) [12], a Bayesian generative model, is proposed to integrate link and content information in the social network for discovering communities, which is an extension of the Author-Recipient-Topic (ART) model [7]. It is assumed that the authors and recipients are generated from a latent group. Another novel method for detecting communities in social networks using links and content is proposed in [14]. In such method, the discussed topics, social links, and interaction types are all used to build several generative community models, namely, TUCM (Topic User Community Model), TURCM-1 and TURCM-2 (Topic User Recipient Community Models) and full TURCM model. More recently, a community profiling model, Collaborator Community Profiling (COCOMP), has been proposed by Zhou et al. in [20] to identify the communities of each user and their relevant topics and groups. In COCOMP, both the social links and topics between users are also considered. In [8, 13], content and links are also learnt together to identify communities.

However, the above methods fail to consider the sentiment information of topics, which is an important factor when discovering more meaningful communities on a level of sentiment. The joint sentiment/topic model (JST) [6], an extension of the traditional Latent Dirichlet Allocation (LDA) model [1], is proposed to detect document-level sentiment and topic from documents. In [5], Li et al. introduce two probabilistic joint topic and sentiment models, namely, Sentiment-LDA and Dependency-Sentiment-LDA. Sentiments are related to topics in both of the models. However, JST, Sentiment-LDA, and Dependency-Sentiment-LDA are not proposed for community discovery.

To overcome the above problems and identify more meaningful communities, we propose our community model, STC, using topic, sentiment and user interactions in a unified way, which takes the topic-sentiment into consideration.

3 Our Community Discovery Model

The graphical representation of our proposed community model, STC, is shown in Fig. 1. There are mainly two different variables in this model, the latent variables and the observable ones:

The latent (hidden) variables: Community assignment $c$ ($c=1,2,\cdots ,M$); Topic assignment $z$ ($z=1,2,\cdots ,K$); Sentiment label assignment $l$ ($l=1,2,\cdots ,S$).
The observable variables: Word $w$ (the word in the document); Person $u$ (the person who is sharing the document).

3.1 Generative Process

Suppose there are $K$ latent topics and $S$ sentiment polarities, for each topic, and for each sentiment, we have: ${{\phi }_{k,s}}|\beta \sim Dir(\beta ),$ where $\phi $ is the topic-sentiment distribution over words.

Let $M$ be the number of communities, each community is related to three key parameters: (1) user participant mixture $\lambda $; (2) topic mixture $\theta $; (3) sentiment mixture $\pi $. Specifically, in each community $m$ ($m=1,2,...,M$), ${{\theta }_{m}}$ is the topic mixture (proportion) for the community $m$, which follows a Dirichlet distribution $Dir(\alpha )$, ${{\lambda }_{m}}$ is the user participant mixture with respect to community $m$, which has a Dirichlet distribution with hyperparameter $\delta $. And ${{\pi }_{m,k}}$ is the sentiment mixture for topic $k$ of community $m$. Note that the sentiments are studied based on topics, it is not reasonable to study sentiments without considering the corresponding topics. For example, given two topics “laptop” and “weather”, the sentiment words “nice” and “bad” can be used to describe both topics. It is not clear which topic is discussed by people with a sentiment word “nice” if the topic is not provided.

$$\begin{aligned} {{\theta }_{m}}|\alpha \sim Dir(\alpha ), \quad {{\lambda }_{m}}|\delta \sim Dir(\delta ),\quad {{\pi }_{m,k}}|\gamma \sim Dir(\gamma ). \end{aligned}$$

We define a community proportion $\psi $ based on the whole corpus, $\psi |\mu \sim Dir(\mu )$. In this model, $\alpha $, $\beta $, $\delta $, $\gamma $, $\mu $ are the hyperparameters of Dirichlet distributions.

Then the generative process for each document $d$, $d=1,2,...,D$ is shown as follows: Choose a community assignment ${{c}_{d}}$ for a document $d{:}\,{{c}_{d}}|\psi \sim Mult(\psi ).$

Assume there are ${{U}_{d}}$ people sharing a document $d$. For each person ${{u}_{d,p}}$ ($p = 1,2,...,{{U}_{d}}$) associated with document $d$, the generative process is: Choose a user ${{u}_{d,p}}$ from the participant mixture of community ${{c}_{d}}{:}\,{{u}_{d,p}}|\lambda ,{{c}_{d}}\sim Mult({{\lambda }_{{{c}_{d}}}}).$

Suppose there are ${{N}_{d}}$ word tokens in a document $d$, For each word token ${{w}_{d,n}}$ ($n=1,2,...,{{N}_{d}}$) in document $d$. The generative process is:

(1)
Choose a topic assignment ${{z}_{d,n}}$ from the topic mixture of community ${{c}_{d}}$:
$$\begin{aligned} {{z}_{d,n}}|\theta ,{{c}_{d}}\sim Mult({{\theta }_{{{c}_{d}}}}). \end{aligned}$$
(2)
Choose a sentiment label ${{l}_{d,n}}$ from the ${{c}_{d}}$-th community’s sentiment mixture:
$$\begin{aligned} {{l}_{d,n}}|{{c}_{d}},{{z}_{d,n}},\pi \sim Mult({{\pi }_{{{c}_{d}},{{z}_{d,n}}}}). \end{aligned}$$
(3)
Choose a word ${{w}_{d,n}}$ from the distribution ${{\phi }_{k,s}}$ over words defined by the topic ${{z}_{d,n}}$ and sentiment label ${{l}_{d,n}}{:}\,{{w}_{d,n}}|{{z}_{d,n}},{{l}_{d,n}},\phi \sim Mult({{\phi }_{{{z}_{d,n}},{{l}_{d,n}}}}).$

From the graphical representation shown in Fig. 1, the joint probability for the proposed model can be written as Eq. 1.

$$\begin{aligned}&P(\mathbf {u},\mathbf {c},\mathbf {z},\mathbf {l},\mathbf {w},\lambda ,\psi ,\theta ,\pi ,\phi |\delta ,\mu ,\alpha ,\gamma ,\beta )\nonumber \\&=P(\mathbf {u}|\mathbf {c},\lambda )P(\mathbf {c}|\psi )P(\mathbf {z}|\mathbf {c},\theta )P(\mathbf {l}|\mathbf {c},\mathbf {z},\pi )P(\mathbf {w}|\mathbf {z},\mathbf {l},\phi ) \\&\quad P(\lambda |\delta )P(\psi |\mu )P(\theta |\alpha )P(\pi |\gamma )P(\phi |\beta ).\nonumber \end{aligned}$$

(1)

3.2 Model Inference and Parameter Estimation

In this model, a document belongs to a single community rather than multiple communities. Each document is shared by at least two people (i.e., an author and at least one recipient) to make sure there is at least one link associated with a document. Once the sender (or the author) of the document is known, the user links associated with this document will be displayed. For inference, the statistics and variables are described in Table 1.

Table 1. List of statistics and variables.

Full size table

Let $t=(d,n)$, the conditional posterior probability of ${{c}_{d}}$, ${{z}_{t}}$, and ${{l}_{t}}$ can be written as follows.

$$\begin{aligned}&P({{c}_{d}}=m|{{\mathbf {c}}_{-d}},\mathbf {u},\mathbf {z},\mathbf {l},\mathbf {w})\nonumber \\&\propto \frac{D_{m}^{-d}+{{\mu }_{m}}}{\sum \nolimits _{j=1}^{M}{{{\mu }_{j}}}+D-1}\times \frac{\prod \nolimits _{k\in {{\mathbf {z}}_{d}}}{\prod \nolimits _{i=0}^{{{f}_{d,k}}-1} {({{\alpha }_{k}}+n_{m,k}^{-d}+i)}}}{\prod \nolimits _{i=0}^{{{f}_{d}}-1}{(\sum \nolimits _{k=1}^{K}{{{\alpha }_{k}}+n_{m,k}^{-d}}+i)}} \\&\times \prod \limits _{k\in {{\mathbf {z}}_{d}}}{\frac{\prod \nolimits _{s\in {{\mathbf {l}}_{{{d}_{(k)}}}}}{\prod \nolimits _{i=0}^{{{f}_{d,k,s}}-1}{({{\gamma }_{s}}+n_{m,k,s}^{-d}+i)}}}{\prod \nolimits _{i=0}^{{{f}_{d,k}}-1}{(\sum \nolimits _{s=1}^{S}{{{\gamma }_{s}}}+n_{m,k,s}^{-d}+i)}}} \times \frac{\prod \nolimits _{p\in {{\mathbf {u}}_{d}}}{({{\delta }_{p}}+g_{m,p}^{-d})}}{\prod \nolimits _{i=0}^{{{e}_{d}}-1}{(\sum \nolimits _{p=1}^{P}{{{\delta }_{p}}+g_{m}^{-d}}+i)}}.\nonumber \end{aligned}$$

(2)

When the community assignment ${{c}_{d}}$ for document $d$ is obtained, for simplicity, the posterior distribution of ${{z}_{t}}$ and ${{l}_{t}}$ can be derived as follows.

$$\begin{aligned}&P({{z}_{t}}=k,{{l}_{t}}=s|\mathbf {w},{{\mathbf {z}}_{-t}},{{\mathbf {l}}_{-t}},{{c}_{d}})\nonumber \\&\propto \frac{n_{{{c}_{d}},k}^{-t}+{{\alpha }_{k}}}{\sum \nolimits _{k=1}^{K}{n_{{{c}_{d}},k}^{-t}+{{\alpha }_{k}}}}\times \frac{n_{{{c}_{d}},k,s}^{-t}+{{\gamma }_{s}}}{\sum \nolimits _{s=1}^{S}{n_{{{c}_{d}},k,s}^{-t}+{{\gamma }_{s}}}}\times \frac{n_{k,s,v}^{-t}+{{\beta }_{v}}}{\sum \nolimits _{v=1}^{V}{n_{k,s,v}^{-t}+{{\beta }_{v}}}}. \end{aligned}$$

(3)

The updated parameters are represented as follows:

$$\begin{aligned} {{\psi }_{m}}=\frac{{{D}_{m}}+{{\mu }_{m}}}{\sum \nolimits _{m=1}^{M}{{{\mu }_{m}}+D}}, ~~~ {{\lambda }_{m,p}}=\frac{{{g}_{m,p}}+{{\delta }_{p}}}{\sum \nolimits _{p=1}^{P}{{{g}_{m,p}}+{{\delta }_{p}}}}, ~~~{{\theta }_{m,k}}=\frac{{{n}_{m,k}}+{{\alpha }_{k}}}{\sum \nolimits _{k=1}^{K}{{{n}_{m,k}}+{{\alpha }_{k}}}}, \end{aligned}$$

$$\begin{aligned} {{\pi }_{m,k,s}}=\frac{{{n}_{m,k,s}}+{{\gamma }_{s}}}{\sum \nolimits _{s=1}^{S}{{{n}_{m,k,s}}+{{\gamma }_{s}}}}, ~~~ {{\varphi }_{k,s,v}}=\frac{{{n}_{k,s,v}}+{{\beta }_{v}}}{\sum \nolimits _{v=1}^{V}{{{n}_{k,s,v}}+{{\beta }_{v}}}}. \end{aligned}$$

4 Experiment and Result Analysis

4.1 Experiment Setup

In the experiments, two types of datasets, the email dataset and the twitter microblog dataset are used. For Enron dataset^{Footnote 1}, we randomly select five user folders, one of them called ‘arnold-j’ is used for the experiment of individual user’s perspective (denoted as arnold-j), and the other four folders, namely, ermis-f, shively-h, whalley-g and zipper-a are used together as a whole dataset (denoted as EnronFourUsrs). We conduct series of preprocessing work for arnold-j and EnronFourUsrs^{Footnote 2}, like the initial duplicated email removal and the basic text mining preprocessing (stopwords removal, stemming, etc.). The second type of dataset is a twitter corpus^{Footnote 3}, which includes 5513 tweets, covering 4 main topics, namely, Apple, Google, Microsoft, and Twitter. We kept the tweets belonging to one of the three sentiments (i.e., positive, negative and neutral), then the empty tweets and the ones without recipients are all removed. Some screen names are extracted from the text of tweets as the recipients, we also preprocess it to make the final document format the same as the Enron datasets. As for the four main topics in original twitter dataset, in fact, each main topic can be divided into several subtopics. The final preprocessed datasets for our experiments are shown in Table 2.

Table 2. Basic information for the final datasets in the experiments.

Full size table

As the work in [5, 6], we also use the subjectivity lexicons as prior information for model learning. Specifically, we use MPQA^{Footnote 4} [16] as the sentiment prior knowledge.

In our model, the initial values of the symmetric hyperparameters are set as: $\alpha =50/K$, $\beta = \delta = \gamma = \mu = 0.1$. The collapsed Gibbs sampling algorithms are executed 500 iterations to estimate the parameters in the models. The datasets are divided into two parts, 80 % of which are used for model training, and the rest are considered as held-out test set.

4.2 Analysis for Distributions Within Communities

In our model, each community has multiple topics, and each topic has multiple sentiment polarities, we studied the distributions within communities on different datasets.

Figure 2 gives the distribution of topics in individual communities. It can be seen from Fig. 2(a) that the topics are almost even within a single community 9 on Enron dataset. We also report selected communities on twitter dataset, in Fig. 2(b) and 2(c), some topics are dominant obviously in the communities. In Fig. 2(b), topic 3 (google android) is the dominant topic in community 1. In community 13, topic 6 (apple use) and topic 8 (iphone service) have large proportions, which are all the subtopics of “apple”. These distributions imply that in some communities, people are only very interested in certain number of topics, which is in accordance with our main goal and community definition.

Table 3. Arnold-j’s biggest community (community 4), $M=5$, $K=10$.

Full size table

Apart from the analysis on the topic distribution within selected individual communities, we also investigated the topic distributions for all the communities, and the sentiment distribution for all the topics in an individual community. Figure 3(a) and 3(b) give the topic and sentiment distributions on twitter dataset, respectively. It is obvious from Fig. 3(a) that different communities have nearly different topic distributions, although some topic distributions for some communities are a bit similar. As can be seen from Fig. 3(b) about the sentiment distribution for topics in community 0 that the sentiments for different topics can be different, which is common in real-world life that two communities may have different sentiment towards certain topics even if they have similar topic distributions (i.e., the two communities are talking about similar range of topics).

4.3 Community Analysis on Individual Users

We also studied the communities for a single user, arnold-j (John Arnold, a vice president in Enron company). Table 3 lists the largest community membership (community 4) for arnold-j, Column 1 and 2 show the main relevant topics and the corresponding probabilities within this community, columns 3–5 list the sentiment proportions for the corresponding topics, and the final column represents the top three active persons with high likelihoods in this community. It is obvious from Table 3 that the dominant sentiment polarity can vary with topics. Also we can see that John Arnold is the core people in this community.

In twitter dataset, we choose one entity with the screen name ‘@Apple’ to study the hidden knowledge in its community. Table 4 shows the selected communities and sentiment topics that @Apple related to. Column 1 gives three selected participated communities, column 2 and 3 list the top two mainly discussed topics for each community with proportions, and the last three columns describe the sentiment proportions for the corresponding topics. It is obvious from Table 4 that the mainly discussed topics among communities are different, which demonstrates that community 9, 10 and 5 are well identified, and also proves the effectiveness and feasibility of our model.

Table 4. Selected communities of the user @Apple (ScreenName), $M=20$, $K=10$.

Full size table

Based on the topics listed in Table 4, we show the top five words for each sentiment polarities of topic 1 and topic 6 in Table 5, each column lists a collection of highly ranked sentiment words and topic words. From these words, we can observe that topic 1 is about twitter, and topic 6 is about apple. It’s a first attempt to detect sentiment-topic level communities via our STC model, while the sentiment information cannot be detected by the existing COCOMP model.

Table 5. Top ranked words for selected topics with different sentiments extracted by STC model.

Full size table

4.4 Comparing with COCOMP Model

Note that the ground-truth communities are usually unavailable, which make the evaluation challenging. To evaluate our model, we also analysed the perplexity value, and made comparison with the state-of-the-art COCOMP model [20], which is a topic-level community discovery model. Each word in our model is determined by two factors, namely topic and sentiment, while there is only one factor, topic, for the COCOMP model. In our STC model, to generate a target word, both the topic and sentiment should be correctly assigned, otherwise the perplexity value will get worse, while only a correct topic assignment is required in COCOMP model. The computation equations for the perplexity of our model is shown in Eq. 4. The lower perplexity tends to have the better performance.

$$\begin{aligned} Perplexity({{D}_{test}})=\frac{\sum \nolimits _{m=1}^{M}{\log P({{{\mathbf {\tilde{w}}}}_{m}}|\mathbf {w})}}{\sum \nolimits _{m=1}^{M}{{{n}_{m}}}}. \end{aligned}$$

(4)

$$\begin{aligned}&P({{{\mathbf {\tilde{w}}}}_{m}}|\mathbf {w}) \nonumber \\&=\prod \limits _{n=1}^{{{n}_{m}}}{\sum \limits _{k=1}^{K}{\sum \limits _{s=1}^{S}{P({{w}_{n}}=t|{{z}_{n}}=k,{{l}_{n}}=s)}}} \, P({{l}_{n}}=s|{{z}_{n}}=k,{{c}_{{{w}_{n}}}}=m)P({{z}_{n}}=k|{{c}_{{{w}_{n}}}}=m) \\ \nonumber&=\prod \limits _{t=1}^{V}{{{\left( \sum \limits _{k=1}^{K}{\sum \limits _{s=1}^{S}{{{\phi }_{k,s,t}}{{\pi }_{m,k,s}}{{\theta }_{m,k}}}} \right) }^{n_{m}^{(t)}}}}. \end{aligned}$$

(5)

$$\begin{aligned} \log P({{\mathbf {\tilde{w}}}_{m}}|\mathbf {w})=\sum \limits _{t=1}^{V}{n_{m}^{(t)}\log (\sum \limits _{k=1}^{K}{\sum \limits _{s=1}^{S}{{{\phi }_{k,s,t}}{{\pi }_{m,k,s}}{{\theta }_{m,k}}}})}. \end{aligned}$$

(6)

In Eq. 4, $D_{test}$ shows the held-out testing documents, ${{{\mathbf {\tilde{w}}}}_{m}}$ denotes the words from testing documents appeared in community $m$, $\mathbf {w}$ represents the words in the training documents. $n_m$ is the number of words in community $m$. As for Eq. 5, ${n_{m}^{(t)}}$ is the number of times a term $t$ observed in community $m$, and ${{c}_{{{w}_{n}}}}$ represents the community that the word ${{w}_{n}}$ appears in.

The perplexity results for the two datasets are shown in Figs. 4 and 5. In each figure we illustrated the values of perplexity for our STC model and COCOMP with varying number of topics and communities. As can be seen from Fig. 4(a) and 4(b), the perplexity values of our model are lower than the COCOMP model. Although in Fig. 5(a) and 5(b), the perplexity value are worse than the COCOMP to some extent, it is still comparable to the COCOMP. Enron email and Twitter are two different types of social networking sites, the former is more formal than the latter. Generally, there are more sentiment information in tweets than in emails. It is not the main concerning about which model has better perplexity value as long as our model has closer performance with COCOMP. Our model is proposed to identify sentiment level communities, which is not considered by COCOMP and other community discovery methods.

5 Discussions

We build our community discovery model, STC, by using social links, topics and sentiment information in a unified way. Those three factors are very significant to the identification of the meaningful community structures. However, it is not indicating that the more additional information incorporated into the model, the better result we can get. When the information is not important, the redundant factors can make the model more complex and inefficient. Not all the communities have sentiment information, our model is proposed to identify communities that have a certain degree of sentiment polarities.

6 Conclusion and Future Work

Discovering communities from networks has been widely studied in recent years, which can help us to understand the latent knowledge and distributions within them. In this paper, we propose a novel community discovery model, STC, to explore communities with different topic-sentiment distributions. This model is built by combining content, links and sentiment words seamlessly, which can identify communities in a level of sentiment analysis. While most of existing methods for community identification fail to consider the valuable sentiment factor in the networks. Experimental results validated on two types of real-world datasets show that our model can detect sentiment-level communities and can achieve comparable performance, which might be applicable for the opinion analysis and decision making in large business and marketing service.

There are several future extensions to investigate for this work. The topic and sentiment words in our experiment are mixed together, it is interesting to separate them. In addition, discovering communities which have obvious sentiment differences on a certain topic is also very useful. Another direction is to investigate the evolution of communities with the change of users’ sentiment topics.

Notes

1.
http://www-2.cs.cmu.edu/~enron/
2.
Note that we will use Enron to represent EnronFourUsrs in the following sections.
3.
http://www.sananalytics.com/lab/twitter-sentiment/
4.
http://www.cs.pitt.edu/mpqa/

References

Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Girvan, M., Newman, M.: Community structure in social and biological networks. PNAS 99(12), 7821–7826 (2002)
Article MATH MathSciNet Google Scholar
Kim, M., Han, J.: A particle-and-density based evolutionary clustering method for dynamic networks. VLDB Endowment 2(1), 622–633 (2009)
Article Google Scholar
Lancichinetti, A., Radicchi, F., Ramasco, J., Fortunato, S.: Finding statistically significant communities in networks. PloS One 6(4), e18961 (2011)
Article Google Scholar
Li, F., Huang, M., Zhu, X.: Sentiment analysis with global topics and local dependency. In: AAAI, pp. 1371–1376 (2010)
Google Scholar
Lin, C., He, Y.: Joint sentiment/topic model for sentiment analysis. In: CIKM, pp. 375–384 (2009)
Google Scholar
McCallum, A., Wang, X., Corrada-Emmanuel, A.: Topic and role discovery in social networks with experiments on enron and academic email. J. Artif. Intell. Res. 30(1), 249–272 (2007)
Google Scholar
Natarajan, N., Sen, P., Chaoji, V.: Community detection in content-sharing social networks. In: ASONAM, pp. 82–89 (2013)
Google Scholar
Newman, M., Girvan, M.: Finding and evaluating community structure in networks. Phys. Rev. E 69(2), 026113 (2004)
Article Google Scholar
Palla, G., Barabasi, A., Vicsek, T.: Quantifying social group evolution. Nature 446(7136), 664–667 (2007)
Article Google Scholar
Palla, G., Derényi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community structure of complex networks in nature and society. Nature 435, 814–818 (2005)
Article Google Scholar
Pathak, N., DeLong, C., Banerjee, A., Erickson, K.: Social topic models for community extraction. In: The 2nd SNA-KDD Workshop, vol. 8 (2008)
Google Scholar
Ruan, Y., Fuhry, D., Parthasarathy, S.: Efficient community detection in large networks using content and links. In: WWW, pp. 1089–1098 (2013)
Google Scholar
Sachan, M., Contractor, D., Faruquie, T., Subramaniam, L.: Using content and interactions for discovering communities in social networks. In: WWW, pp. 331–340 (2012)
Google Scholar
Tyler, J., Wilkinson, D., Huberman, B.: Email as spectroscopy: automated discovery of community structure within organizations. In: Communities and Technologies, pp. 81–96 (2003)
Google Scholar
Wilson, T., Wiebe, J., Hoffmann, P.: Recognizing contextual polarity in phrase-level sentiment analysis. In: HLT-EMNLP, pp. 347–354 (2005)
Google Scholar
Xie, J., Szymanski, B., Liu, X.: Slpa: uncovering overlapping communities in social networks via a speaker-listener interaction dynamic process. In: ICDM Workshops, pp. 344–349 (2011)
Google Scholar
Yang, T., Jin, R., Chi, Y., Zhu, S.: Combining link and content for community detection: a discriminative approach. In: KDD, pp. 927–936 (2009)
Google Scholar
Zhou, D., Manavoglu, E., Li, J., Giles, C., Zha, H.: Probabilistic models for discovering e-communities. In: WWW, pp. 173–182 (2006)
Google Scholar
Zhou, W., Jin, H., Liu, Y.: Community discovery and profiling with social messages. In: KDD, pp. 388–396 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of York, York, UK
Baoguo Yang & Suresh Manandhar

Authors

Baoguo Yang
View author publications
You can also search for this author in PubMed Google Scholar
Suresh Manandhar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Suresh Manandhar .

Editor information

Editors and Affiliations

National Chiao Tung University, Hsinchu, Taiwan
Wen-Chih Peng
Google Research, Mountain View, California, USA
Haixun Wang
University of Melbourne, Melbourne, Victoria, Australia
James Bailey
National Cheng Kung University, Tainan, Taiwan
Vincent S. Tseng
Japan Advanced Institute of Science and Technology, Nomi City, Japan
Tu Bao Ho
Nanjing University, Nanjing, China
Zhi-Hua Zhou
National Chengchi University, Taipei, Taiwan
Arbee L.P. Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, B., Manandhar, S. (2014). STC: A Joint Sentiment-Topic Model for Community Identification. In: Peng, WC., et al. Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2014. Lecture Notes in Computer Science(), vol 8643. Springer, Cham. https://doi.org/10.1007/978-3-319-13186-3_48

Download citation

DOI: https://doi.org/10.1007/978-3-319-13186-3_48
Published: 26 November 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13185-6
Online ISBN: 978-3-319-13186-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

STC: A Joint Sentiment-Topic Model for Community Identification

Abstract

Similar content being viewed by others

Community Mining and Cross-Community Discovery in Online Social Networks

Community Detection Through Topic Modeling in Social Networks

Local Community Detection Using Social Relations and Topic Features in Social Networks

Keywords

1 Introduction

2 Related Work