Keywords

1 Introduction

Geotagged images are pervasive, and they also provide an intuitive and objective view of our life. Thanks to these properties, images can easily reflect personal, regional, even social characteristics, and plenty of research works have been conducted with social images to facilitate people’s life. Geographical analysis from social media has been widely investigated in the recent years. While most of existing studies focus their analysis on landmarks with the assumption that they are representative to regions [1,2,3,4], other perspectives such as local festivals and events could also be essential for profiling a region. We thus study the problem of forming comprehensive description of geographical characteristics from social media. With the description of geographical characteristics in one specific region, we could better recognize this region and boost a number of utilities such as tourist advertising, etc.

While some existing applications such as tourist recommendation and location retrieval could also extend to this problem [5,6,7,8], they mainly rely on the textual information, e.g., social tags. To our best knowledge, geotagged photos help understand intuitively a specific region and it can boost plenty of applications in several domains. For example, it is interesting that systems could generate a recommendation based on its understanding of images, which leads us free from taking effort to find a proper word for the description of the region. Therefore, since the goal is to understand a region from images, the challenge lies in how to map low level visual features to semantic characteristics.

Fig. 1.
figure 1

Motivation of the model. We assume that in every geographic area, people’s life consists of several aspects, e.g. sports, music, etc. These aspects could be presented by several clusters, while clusters are formed by vast images.

In this paper, we propose a Geographical Latent Attribute Model (GLAM) to learn geographical characteristics from photo collections. We assume that each region consists of some latent “attributes” (considered as characteristics) and each “attribute” consists of image “clusters”. The motivation of our model is illustrated in Fig. 1 using Beijing as an example. A city may be described by several aspects (e.g., historical buildings), and each aspect includes different image clusters (e.g., antiques, temples, sculptures). These clusters are summarized from images taken in Beijing. Following the idea of the generative model, we introduce corresponding latent variables to formalize this procedure. By learning the latent parameters, a comprehensive view about geographical regions is formed.

The major contributions of this paper could be summarized as follows:

  • We propose a Geographical Latent Attribute Model (GLAM) to learn geographical characteristics from photo collections without utilizing any textual information.

  • We validate the proposed model with 2.5M Flickr photos taken in China to demonstrate its effectiveness in both qualitative and quantitative ways.

  • As one of the potential applications, a region recommendation strategy is proposed based on the similarity between region’s characteristics and user’s interest according to his/her photo album.

The rest of paper is organized as follows: In Sect. 2, we review the related work. Section 3 explains our model and its inference technique. The experiment results will be displayed in Sect. 4, and we conclude our paper in Sect. 5.

2 Related Work

Plenty of works have been conducted in geographical analysis. Ji, et al. [2] propose a hierarchical structure to mine city landmarks from view, scene and city layers. [9] analyzes the attribute at region level for region exploration and [10] handles the urban understanding with CNN. Livia Hollenstein and Ross S. Purves [11, 12] focus on social media to find out how people generate their understanding for a city. Similarly, [1] extract the tags representing landmarks to better present and extract view of one region. In [3, 4], the authors find the popular landmarks using mean shift.

This work is also related to several applications such as location retrieval, tourist recommendation, etc. [5] shows the same viewpoint that users are more interested in a geographic area than the precise GPS coordinate. Our work thus pay more effort into recommending users with a proper geographic area rather than location estimation with exact geographic coordinates. [6, 7] give personalized tourist recommendation based on users’ interest and their similarity, while our work focus more on the similarity between user’s interest and geographic characteristics.

3 Model

3.1 Geographical Latent Attribute Model

The plate notation of GLAM is illustrated in Fig. 2. Assuming that we have M regions and each region has \(N_m\) images, we target to learn the regional attribute distributions \(\{\theta _m\}_{m=1,...,M}\) from these images. We first use GoogLeNet to extract one D dimensional feature vector \(v_{mn}\) for each image. Then our problem could be formalized to learn \(\{\theta _m\}_{m=1,...,M}\) from the feature collection \(\{v_{11},...,v_{MN_M}\}\).

We transform this problem into a generative procedure and consider that each region has a distribution over characteristics and each characteristic has a distribution over clusters which are modeled by a series of Gaussian mixtures. Both “characteristic” and “cluster” are introduced as latent variables in this hierarchical structure and could be inferred by the observed variables \(\{v_{11},...,v_{MN_M}\}\). The generative procedure is summarized as follows:

  • Choose regional characteristic proportion \(\theta _{m} \sim Dir(\alpha )\).

  • Choose the characteristic of one image \(i_{mn} \sim Multinomial(\theta _{m})\).

  • Choose the cluster \(z_{mn} \sim Multinomial(\phi _{i_{mn}})\), where \( i_{mn} \in \{ 1,2,...,K \}\).

  • Choose each visual vector , where \(z_{mn} \in \{1,2,...,K^{\prime }\}\).

Fig. 2.
figure 2

The plate notation of GLAM

In our model, \(\{(\mu _{k^{\prime }},\sigma _{k^{\prime }})\}_{k^{\prime }=1,...,K^{\prime }}\) constitute the visual space and \(\{\varPhi _k\}_{k=1,...,K}\) are used to capture the characteristic-cluster distributions. Latent variables \(z_{mn}\) and \(i_{mn}\) are decided by \(v_{mn}\) and reversely affect the regional characteristic distribution \(\theta _m\). In short, we use a topic model structure to learn the high level concepts at the top layer and facilitate Gaussian mixture model to cluster low level visual features at the bottom layer.

3.2 Inference and Learning

In this part, we present our inference algorithm. The key inferential problem of our model is to compute the posterior distribution of latent variables given data as Eq. 1.

$$\begin{aligned} p(\theta ,i,z|\alpha ,\phi ,\mu ,\sigma ,v) = \frac{p(\theta ,i,z,v|\alpha ,\phi ,\mu ,\sigma )}{p(v|\alpha ,\phi ,\mu ,\sigma )} \end{aligned}$$
(1)

Above equation is intractable due to the non-integrable denominator and an alternative method, e.g., Gibbs sampling or variational approximation [13], could be employed. In this paper, we adopt a mean field variational bayes method [14] (variational EM) to deal with our model. Following its methodology, we assume that the variational distribution is defined as

$$\begin{aligned} q(\theta ,i,z) = q(\theta |\gamma )q(i|\psi )q(z|\varPhi ), \end{aligned}$$
(2)

where \(\gamma \) is the Dirichlet parameter and \(\psi \), \(\varPhi \) are the multinomial parameters. With this specification, the latent variables could be approximated by minimizing the Kullback-Leibler (KL) divergence between Eqs. 1 and 2.

$$\begin{aligned} {\arg \min }_{(\gamma ,\psi ,\varPhi )}D(q(\theta ,\psi ,\varPhi )|p(\theta ,\psi ,\varPhi )) \end{aligned}$$
(3)

By setting the derivative of free parameters \(\gamma \), \(\psi \), \(\varPhi \) in Eq. 3 to zero, we obtain the following equations.

$$\begin{aligned} \varPhi _{mnk^{\prime }} \propto \exp (\sum _{k}\psi _{ijk}\log {\varPhi _{kk^{\prime }}})\mathcal {N}(v_{ij}|\mu _{k^{\prime }},\sigma _{k^{\prime }}) \end{aligned}$$
(4)
$$\begin{aligned} \psi _{ijk} \propto \exp (\varPsi (\gamma _{ik}))\exp (\sum _{k^{\prime }}\varPhi _{ijk^{\prime }}\log \phi _{kk_{\prime }}) \end{aligned}$$
(5)
$$\begin{aligned} \gamma _{ik} = \alpha _{k} + \sum _{j}\psi _{ijk} \end{aligned}$$
(6)

The most frequent approach to estimate the model parameters is maximizing the likelihood of observed variables, i.e., \(p(v|\alpha ,\phi ,\mu ,\sigma )\). Although there is no analytical integral for this likelihood, Jensen’s inequality could be used to get an adjustable lower bound.

$$\begin{aligned}&\ln p(v|\alpha ,\phi ,\mu ,\sigma )) \nonumber \\&= \ln \int \limits _\theta {\sum \limits _{i,z} {p(v,\theta ,i,z|\alpha ,\phi ,\mu ,\sigma )d\theta } } \nonumber \\&= \ln \int \limits _\theta {\sum \limits _{i,z} {\frac{{p(v,\theta ,i,z|\alpha ,\phi ,\mu ,\sigma )q(\theta ,i,z)}}{{q(\theta ,i,z)}}d\theta } } \\&\geqslant {E_q}(\ln p(v,\theta ,i,z|\alpha ,\phi ,\mu ,\sigma )) - {E_q}(\ln q(\theta ,i,z)) \nonumber \\&\triangleq L(\alpha ,\phi ,\mu ,\sigma ) \nonumber \end{aligned}$$
(7)

With previous optimal free parameters \(\gamma \), \(\psi \), \(\varPhi \), we could maximize the lower bound L by setting the derivatives to zero with respect to the parameters \(\phi \), \(\mu \), \(\sigma \) respectively. Then, we have following solutions:

$$\begin{aligned} \phi _{kk^{\prime }} \propto \sum _{i}\sum _{j}\psi _{ijk}\varPhi _{ijk^{\prime }} \end{aligned}$$
(8)
$$\begin{aligned} \mu _{k^\prime } = \frac{\sum _i\sum _j\varPhi _{ijk^{\prime }}v_{ij}}{\sum _i\sum _j\varPhi _{ijk^{\prime }}} \end{aligned}$$
(9)
$$\begin{aligned} \sigma _{k^{\prime }} = \frac{\sum _i\sum _j\varPhi _{ijk^{\prime }}(\mu _k^{\prime }-v_{ij})^\mathrm {T}(\mu _k^{\prime }-v_{ij})}{D\sum _i\sum _j\varPhi _{ijk^{\prime }}} \end{aligned}$$
(10)

And for Dirichlet prior \(\alpha \), we use Newton-Raphson method to update it like LDA [15]. Iterating the inference and parameter estimation procedure, we would gradually acquire the solution of our model.

4 Experimental Results

To validate GLAM for geographical analysis, we evaluate it on a Flickr dataset of 2.5M photos in both qualitative and quantitative ways. In addition, we show its potential to retrieve the regions of interest.

Fig. 3.
figure 3

The color map of data distribution in China. The warmer the color is, the more images are taken there. Taiwan possesses the most amount of data, while Ningxia possesses the least. The average amount in each province is about 85K.

4.1 Experimental Settings

We crawled 6.5M photos that had the GPS information in the YFCC100M dataset [16]. Then with the database of GADMFootnote 1, which is a database containing the boundary geo-coordinates of each administration region, we filter out the photos not taken in China and the 2.5M remaining photos are divided into 34 groups according to the administration regions as shown in Fig. 3. One feature vector is extracted for each image from the dropout layer (the second last layer) of GoogLeNet [17].

Fig. 4.
figure 4

Region’s similarity computed with different features. We can observe the results of text feature and our model are quite coherent, while the results of the others are difficult to determine the similar regions. Presented with \(n=20\), \(K=15\), \(K^{\prime }=500\).

Table 1. Comparing the correlation between ground truth and the three types of features.

4.2 Quantitative Evaluation

In this section, we provide a quantitative evaluation for our GLAM model. The GLAM aims to find a better description for regions based on social images. As we know, textual content is good at delivering semantic information. Thus, we employ the documents from the online tour guide “TravelChinaGuide”Footnote 2, the largest and most authoritative online tour operator in China, for comparison. Each document covers general introduction, facts, even life details for each region. We build topic models with LDA [15] from the textual document. The Euclidean distance between regions is computed based on the learned topic model. Similarly, we compute the distance between regions based on visual features learned by GLAM, Gaussian Mixture Model (GMM), and average visual features extracted directly from GoogLeNet. The corresponding distance matrix are shown in Fig. 4, where brighter colors mean higher similarity. It can be seen that our model presents more similar results as textual features, suggesting that our model generates a better semantic description for regions.

To test the effectiveness of our model, we employ the Kernel Canonical Correlation Analysis (KCCA) to compute the correlation between the distance matrix obtained from the textual feature and the other three types of visual features. As shown in Table 1, from textual feature we learn respectively 5, 10, 15, 20, 25 and 30 topics. Meanwhile, GLAM is severally trained with 200 and 500 clusters, and the number of characteristics K is set to 10, 15 and 20 respectively in the experiments. Distance matrix built from GMM and average visual features lead to a weak correlation to that of textual feature, with the highest correlation at 0.52 and 0.46, respectively, while the highest correlation for GLAM is 0.82, confirming it has a higher similarity to textual features in terms of semantic region description. This superiority is due to that geographical characteristics is abstract and semantic, while GMM and CNN features lack the mechanism to model the semantic features, which makes them difficult to discover complex patterns.

Fig. 5.
figure 5

Analysis of the region “Beijing” and “Shanghai”.

4.3 Qualitative Evaluation

We illustrate here an example (Fig. 5). A region is described by its dominant characteristics and each characteristic is described by the corresponding top 5 clusters. Here we only present one set of experiment results for qualitative evaluation, where the number of characteristics and number of clusters are respectively set to 15 and 500 with the strongest correlation in Table 1. The rest of results can be accessed at: https://sites.google.com/site/geolatentim/.

Take Beijing and Shanghai, two famous cities in China as an example. As shown in Fig. 5, according to Beijing’s characteristic distribution, the characteristic 11 dominates, which can be regarded as the main descriptor for Beijing. To interpret this characteristic, the top 5 representative clusters are picked out to describe it. We manually summarize these five clusters, which correspond to Chinese antique, Chinese tower, Chinese architecture, Chinese roof decoration and pedestrian street, indicating people in Beijing prefer a Chinese traditional atmosphere. This conclusion is well-aligned with Beijing because Beijing is the national center of Chinese history and culture and the historical sites are quite common. Similarly, we can see that Shanghai, the economic center of China, is a modern city with large population, as its characters are mainly described by skyscraper, city scene, urban night, modern traffic, and street scene with people crowd. Among all these regionsFootnote 3, it is remarkable that some cities are dominated by one single characteristic (e.g. Beijing, Shanghai) while others possess diverse characteristics (e.g. Sichuan, Shandong) because of geographical and cultural reasons.

4.4 City Recommendation

In this section, we introduce a strategy for region recommendation based on user’s photo album. We evaluate the effectiveness of GLAM for recommendation with the Mean Reciprocal Rank metric (MRR).

A photo collection could reflect a user’s interest since it contains snapshots of things that the user adores. Here we design a strategy based on the similarity between a user’s interest and a region’s geographical characteristics for recommendation. First, we compute an interest distribution \(\theta _{new}\) for a photo collection by Eq. 6. Then, we measure the similarity between this distribution and a region’s characteristics with the following distance metric:

$$\begin{aligned} d_i = { ||\theta _i - \theta _{new} || }_{i=1,...,M}^{2} \end{aligned}$$

where \(\theta _i\) is the characteristic distribution in the \(i^{th}\) region. The smaller the distance is, the more similar the collection and the region are. The top 3 similar provinces are picked as a recommendation. In our experiments, we crawled additional photos with GPS information from Flickr communityFootnote 4 (not included in our training data) for both quantitative and qualitative evaluation.

Fig. 6.
figure 6

The recommendation accuracy. In this figure, we can observe that the recommendation accuracy increases as the input number of images increases and GLAM features outperform than GMM and visual features.

For quantitative evaluation, according to the GPS information, we choose 100 images from a province to form a virtual album and the province is regarded as the label of this album. Then we input different amount, accumulating gradually until 100, of images for each album and compute the average MRR to show the recommendation accuracy. Figure 6 presents the average recommendation accuracy with different parameters. The best average MRR performance of GLAM region feature (\(K=15, K^{\prime }=500\)) is over 40% when input number is more than 70, and according to the property of MRR, we can infer that the label region appears in the top 3 recommended regions, which provide us a reliable recommendation result. Compared with GMM features and visual features, they possess close performance when the input number is small. Nevertheless, it is clear that our model could better perform with more input images and outperform GMM feature and average CNNs visual features because more images could better cover the personal characteristics. For qualitative evaluation, we randomly pick several users, and in each user’s photo collection, we randomly select 100 images to form test photo albums. Since the parameter set as 15 “attributes” and 500 “clusters” provide the best performance (Fig. 6), we here employ this parameter setting. Figure 7 present one example: the photo collection containing mostly nature scenes which present mountain and waterside. This indicates the owner of the photo collection may be a fan of traveling in nature. Our recommendation result shows Yunnan, Chongqing and Jiangxi, which are famous for their landscape. Browsing the photos in these regions, we observe the scenery is similar to the photo collection.

Fig. 7.
figure 7

The album recommendation. It is clear that the recommended regions possess the similar natural scene like the input ones.

5 Conclusion

In this paper, assuming “attributes” as the descriptors of regional characteristics, we have attempted to find the characteristic relevance of a region and use the high-relevant ones to describe this region. Meanwhile, representative clusters, formed by social images, are picked out to present the attributes of regions. The experiments on photos in China qualitatively and quantitatively demonstrate our model has the capacity to semantically describe a region with image content. Based on our model, the regional features could be extracted, from which the recommendation strategy profits to provide reliable results and outerperform GMM features, as well as average CNNs features in the experiments. Therefore, our model is promising for plenty of applications and could be further developed in future work related to geographical characteristics.