Keywords

1 Introduction

Community detection is an important topic in mobile social network, which could help us to understand mobile social network structures and provide good location based services. Community detection in mobile social network has a broad range of applications such as Ad pushing, crowd-sourcing, marketing etc. There are two types of community detection algorithms in mobile social network. The first type is graph-based community detection algorithms such as Newman algorithm [11], Louvain algorithm [2] and Label Propagation Algorithm (LPA) [14]. These state-of-the-art graph-based community detection algorithms could find cohesive subgroups in a network based on graph theory, while ignore the characteristics of node in the network. The second type is semantic-based community detection method [1] can explore overlapping semantic communities from the perspective of characteristics analysis. In mobile social network, users’ location interest is very important for location-based services which indicate users’ location preference. Many existing community detection approaches [6, 8, 9] did not take mobile users’ location interest into account, that cannot support personalized location-based services.

We take mobile users’ location interest into account which could support personalized location-based services. In this paper, we propose an location-interest-aware community detection approach for mobile social networks. Specially, we develop a spatial-temporal topic model to describe users’ location interest, and introduce an auto encoder mechanism to represent users’ location features and social network features as low-dimensional vectors, based on which a community detection algorithm is applied to divide users into sub-graphs. We conduct extensive experiments based on a real-world mobile social network dataset, which demonstrate that the proposed community detection approach outperforms the baseline algorithms in a variety of performance metrics.

Our main contributions are summarized as follows.

  • A novel idea of community detection approach based on location interest mining.

  • A spatial-temporal topic model to extract users’ interests.

  • Comprehensive feature representations.

  • An auto encoder based location-interest-aware community detection approach.

  • Experiments based on real-world dataset.

2 Related Work

In this section, we review related work of community detection models. The input of the community detection algorithm is a graph (a set of nodes representing individuals, they are connected by edges), and the output is a list of node groups representing the communities. Usually everyone belongs to a community. The community detection algorithm in social networks can be divided into graph-based methods and semantic-based methods. Graph-based community detection methods, such as Newman algorithm [11], Louvain algorithm [2] and Label Propagation Algorithm (LPA) [14], these graph-based community detection algorithms [11, 12], which only use graph theory, have obvious limitations. For example, they ignore the characteristics of user interest which can lead to the lack of interest cohesiveness in the community detection results. Semantic-based community detection methods [1, 14] can explore overlapping semantic communities from the perspective of characteristics analysis. Community detection can also transform users into vectors, use unsupervised clustering algorithms for community segmentation, and classify similar users into the same cluster. Clustering algorithms could be K-means [7], DBSCAN [5], spectral clustering [13], etc. The more similar or closer the users in the same cluster, the better the result is. Many community detection approaches did not take mobile users’ location interest into account, that cannot support personalized location-based services.

3 Our Work

The solution framework shows in Fig. 1, including the following subsection: data cleaning, extract location interest feature by LDA, extract location distribution features, extract user relation features, user representation by auto encoder, interest community clustering and visualization.

Fig. 1.
figure 1

The solution framework

Fig. 2.
figure 2

Auto encoder architecture

3.1 Data Cleaning

We use the mobile social network dataset from Gowalla [3], a location-based social networking service through which users share their locations by checking-in. The Gowalla dataset contains a total of 6,442,890 check-ins of 196,591 users over the period of Feb. 2009 - Oct. 2010.

In our work, we extract users from San Francisco. Since some of the users rarely check-in, we filter out users with fewer than 10 check-in locations. Because some locations have few associated check-ins, we filter out locations with fewer than 10 check-ins. After filtering out the unqualified users and locations, the dataset has 1,995 users, 3,251 locations, and 106,098 check-ins.

3.2 Extract Location Interest Features by LDA

As we found in [10], check-ins at a location at different time represent different degrees or different types of location interest. In order to extract users’ interests, we consider both time and location factors. We propose a temporal-spacial LDA (Latent Dirichlet Allocation) model to extract interests. Specifically, we treat the combination time and location of a check-in event as a word, and view the historical trajectory of a user as a document.

We apply LDA to all users’ document, and convert their check-in trajectories into interest distribution represented by K topics. The user distribution on each location are shown in the user-location matrix, which is obtained by counting the trajectories. The users’ interest distribution is shown in user-interest matrix, where the number in the i-th row and j-th column represents the interest degree of the i-th user in the j-th topic. The interest distribution at each location is shown in the interest-location matrix. These latter two matrices come from the LDA process which use Gibbs sampling [4] as inference technique. To determine the best number of K for LDA, we use cross validation to determine the likelihood. It is observed that the maximum likelihood achieved for \(K=40\). Therefore 40 topics are selected in our experiments. In our experiment after 30,000 iterations the LDA model converges.

3.3 Extract Location Distribution Features

Here we use the way of dividing the map into \(N*M\) grids, extracting the user’s position distribution to form a position vector. The dimension of each user location vector is \(N*M\). Here we set it to \(N*M=6*6\).

3.4 Extract User Relation Features

The representation of the user relationship can be represented by the adjacency matrix. In this case, if the number of users is \(N_u\), the user relationship can be represented by an adjacency matrix of \(N_u*N_u\). Each user’s friend relationship vector dimension is \(N_u\). Since the dimension of the user relationship is very high here, the dimensionality reduction is performed by auto encoder (Fig. 4).

The algorithm principle of user friend relationship vector dimension reduction is shown in Fig. 2. Auto encoder is neural network that intended to replicate the output of the inputs. It works by compressing the input into a latent-space representation, and then reconstructing the output of this representation.

We set the user’s friend relationship vector by the auto encoder. We set the size of input layer to 1995, and set the size of hidden layer to 40, with the structure of \(1995-128-64-40\). The network convert user vector from 1995 dimensions to 40 dimensions.

Fig. 3.
figure 3

Community visualization, Q = 0.379313

Fig. 4.
figure 4

Community visualization at low dimension, Q = 0.720976

3.5 User Representation

The representation of a user is based on the interest vector, location distribution vector, and user relationship vector derived from last three steps.

3.6 User Representation by Auto Encoder

Here we use auto encoder to convert the user vector to low dimension for clustering. The algorithm principle of the user vector dimension reduction is shown in the Fig. 2.

3.7 Community Clustering and Visualization

After the user vector is encoded by auto encoder, we then cluster the users according to the K-Means algorithm. After clustering, We plot the user vector of 2 dimensions, and we can see the visual location community. Different colors represent different clustering groups, which intuitively visualizes the distance relationship between users, as shown in Fig. 3.

4 Experiment

4.1 Experimental Setting

Dataset. We use the mobile social network dataset from Gowalla [3]. Gowalla is a location-based social networking service where users share their locations by checking-in. The Gowalla dataset contains a total of 6,442,890 check-ins of 196,591 users over the period of Feb. 2009 - Oct. 2010.

Default Parameters. Each user and location have at least 10 check-ins. The number of LDA topics is 40. The vector dimension of the user’s geographic location distribution is \(6*6=36\) dimension. The user friend relationship vector is converted to 40 dimensions by auto encoder. The user vector is convert to 116-dimensional via auto encoder. The K-Means parameter K is 120.

4.2 Baseline Algorithms

We compare the proposed method with the following algorithm.

  • Graph model based algorithm-FastNewman [11], LPA [14].

  • Spectral clustering based on interest. The user’s interest vector is extracted using the topic model algorithm LDA, and then the user interest vector is classified by spectral clustering [13].

  • Geographic location based algorithm. Each user’s most frequent geographic location is used as the classification basis.

4.3 Metric

We use an improved community module metric for performance evaluation. It includes original community modularity, community location similarity, and community interest similarity scores.

The original metric of community modularity [12], which measures the quality of community clustering, is defined as follows:

$$\begin{aligned} Q_q=\frac{1}{2m}\sum _{vw}\left[ A_{vw}-\frac{k_{v}k_{w}}{2m}\right] \delta (c_{v},c_{w}) \end{aligned}$$
(1)

A is the adjacency matrix of the network, \(k_{v}\) is the degree of node v, and m is the number of edges in network A. \(c_{v}\) represents the community to which the node v belongs. If \(i=j, \delta (i,j) =1\), otherwise \(\delta (i,j) =0\). In our approach, we adopt improved community modularity performance metrics taking user’s interests, locations, and social relations into account.

$$\begin{aligned} Q=Q_{q}+Q_{\alpha }+Q_{\beta } \end{aligned}$$
(2)
$$\begin{aligned} Q_{\alpha }=\frac{1}{2m_{e}}\sum _{vw}\alpha (v,w)\delta (c_{v},c_{w}) , \end{aligned}$$
(3)
$$\begin{aligned} Q_{\beta }=\frac{1}{2m_{e}}\sum _{vw}\beta (v,w)\delta (c_{v},c_{w}) \end{aligned}$$
(4)

\(\alpha (v,w)\) means that if different users vw interest are similar, then \(\alpha (v,w)=1\), otherwise \(\alpha (v,w)=0\). Since the user self comparison is not meaningful, set \(\alpha (v,v)=0\) so that when the number of communities equals the number of users, the \(Q_{\alpha }\) value must be zero. \(\beta (v,w)\) means that if different users vw are similar in orientation, \(\beta (v,w)=1\), otherwise \(\beta (v,w)=0\). Similarly, \(\beta (v,v)=0\) is set so that when the number of communities is equal to the number of users, the \(Q_{\beta }\) value must be zero. Whether the different users’ vw interests are similar or not, is determined by whether the user vw is most interested in the same topic. Whether the user’s vw orientation is similar or not, is determined by whether the user’s vw is most often in the same orientation. \(m_{e}\) represents the number of vw pairs that make \(\delta (c_{v},c_{w})\) not equal zero. This makes the value of \(2m_{e}\) increase dramatically when the number of clustering is too small, which can penalize the modularity. Intuitively, two users with similar interests and similar locations will lead to higher probability of being assigned to same community. If two users are assigned to same community, the more similar they are, the higher the module metric. Therefore, considering the social similarity degree of the user’s location, the higher modularity indicates that the community clustering performance is better. \(Q_{q}\) value range \([-1/2,1]\), \(Q_{\alpha }\) value range [0, 1/2), \(Q_{\beta }\) value range [0, 1/2), so Q generally takes the range of \([-1/2,2)\).

Table 1. Algorithm performance comparison

4.4 Experimental Result

We compare the performance of our algorithm with the benchmark algorithms. According to Table 1, the performance of the proposed algorithm is better than the benchmark algorithms. The Fast Newman algorithm achieved the highest \(Q_{q}\) value of 0.466334, but it is far behind our proposed algorithm in terms of community interest score and community geographic location score. The LPA graph model algorithm also achieved high \(Q_{q}\) values, but it is far behind our proposed algorithm in terms of \(Q_{\alpha }\) and \(Q_{\beta }\). Based on the geographic location algorithm, each user’s most frequently visited geographic location is used as a classification basis, it get the highest \(Q_{\beta }\) score of 0.499253. However, the algorithm does not take into account the network modularity and interest similarity, so it does not perform well on other scores. The interest spectral clustering method uses the topic model algorithm LDA to extract the user’s interest vector, and then clusters the users with spectral clustering. So get the highest \(Q_{\alpha }\) score of 0.330194. However, the algorithm does not take into account the location module, so it does not perform well on other scores. The algorithm we proposed takes into account the interest, location, and friend relationship. Due to dimension reduction to reduce complexity, not only the computational efficiency is high, but also the overall performance is best in terms of Q. Overall, our approach is better than traditional benchmark algorithms by combining location interests, location, and friendships.

4.5 Parameter Sensitivity Analysis

Next, we perform parameter sensitivity analysis by comparing system performance under different parameters, including K-means clustering parameter K, fused data type.

K-means Clustering Parameter K. K-means clustering parameter K is an important system parameter of the model. The K-means clustering algorithm depends on the clustering group number K. In the experiment, we increased K from 60 to 480. The result is shown in Table 2. The K-means parameter K has an impact on performance. The performance is best when the K-means parameter \(K=120\). This means that bigger or smaller community number cannot improve clustering performance. Small community number means that users with different interests are assigned to the same community, making the interests of the communities different. Big community number shows that the network module degree \(Q_{q}\) value not perform well.

Table 2. The influence of K
Table 3. Fused types of data

Fused Types of Data. Our model incorporates user interests, location, and social information. The types of data fused in the model is an important system parameter of the model. In the experiment, we increased this parameter from only the location information to all kinds of data participated in the operation. The result is shown in Table 3. The more types of data is fused, the better our approach performs: based on the interest and location, the method is slightly better than only the interest-based method. After adding the social vector, the overall Q performance is the best. Overall, our approach is better than traditional benchmark algorithms by combining location interests, location, and friendships. The types of data has an impact on performance. When the number of data types is too small, it means that the amount of information considered in clustering is few, so that the clustering performance is not high under comprehensive consideration. The model performs best when all types of data is included. This means more data can bring more information and improve performance.

5 Conclusion

In this paper, we propose a location-interest-aware community detection approach for mobile social networks. We develop a spatial-temporal topic model to describe users’ location interest, and introduce an auto encoder mechanism to represent users’ location features and social network features as low-dimensional vectors, based on which a community detection algorithm is applied to divide users into sub-graphs. We conduct extensive experiments based on a real-world mobile social network dataset, which demonstrate that the proposed community detection approach outperforms the baseline algorithms in a variety of performance metrics.