1 Introduction

With the popularity of smart devices and the support of online sharing platforms, people can share their footprints on the Internet anytime and anywhere (e.g., check-in at or upload the geo-tagged photos of visited locations). Generally, users can rely on manual search from these large amounts of information to find some locations satisfying their preferences, which is usually much cumbersome and time-consuming. Location recommendation systems [1] can mine users’ preferences from their history and automatically provide suitable locations, which would provide a huge convenience for users.

In recent years, location recommendation has attracted the attention of the research community. According to the types of data sources used by the researchers, location recommendation studies can be divided into three categories: 1) full GPS trajectory based location recommendation, 2) check-in based location recommendation, and 3) geo-tagged photo based location recommendation. For the early full GPS trajectory based location recommendation, researchers often use full GPS trajectory data to mine interesting locations and users’ movement patterns. Then, they employ the similarities derived from users’ history to provide personalized location recommendation [2,3,4]. For check-in based location recommendation, researchers usually consider social relationship to provide more appropriate recommendation results [5,6,7,8]. For geo-tagged photo (usually attached with a time-stamp and a coordinate, indicating when and where the photo was taken) based location recommendation, researchers usually first extract travel locations exploiting the geotags of photos, and then take users’ preferences into account to recommend locations [9,10,11].

To improve recommendation performance, different additional information (e.g., sequential [12,13,14], textual [7, 15,16,17], geographical [5, 18, 19], and visual [20,21,22,23] information) has been introduced into location recommendation methods. Existing studies can be divided into two categories: The former learns features from additional information and users’ history simultaneously to make personalized recommendation [24, 25], which usually needs lots of training examples to learn good representations of users and locations; The latter usually first extract features from additional information and then train a recommendation model based on users’ history by using these features as priors [26, 27], which can integrate different additional information flexibly and achieve comparable performance under fewer training examples. However, there exists no method that exploits all of the above-mentioned additional information, and the significance of different additional information is not well studied.

To address the above-mentioned problem, we propose Weighted Multi-Information Constrained Matrix Factorization (WIND-MF) for personalized travel location recommendation based on geo-tagged photos. We firstly exploit multi-information to profile users and travel locations, and then assign different weights to corresponding similarities for personalized travel location recommendation.

The main contributions of this paper are as follows:

  1. 1)

    Propose WIND-MF for personalized travel location recommendation based on geo-tagged photos, which can exploit photos, users’ visit sequences, and textual tags, to profile users and locations comprehensively.

  2. 2)

    Assign different weights to visual, sequential, and textual similarities as well as co-visit probabilities, which can capture the significance of different additional information.

  3. 3)

    Conduct extensive experiments to study the impact of different additional information. The results reveal that visual and sequential information contributes most to improving recommendation performance.

The rest of the paper is organized as follows. Section 2 reviews the related work. Section 3 gives the definitions of some basic concepts and then formally defines the problem. Section 4 introduces the proposed method. Section 5 presents the experiment results, and Section 6 concludes the paper and gives a brief outlook on future work.

2 Related work

In this section, we demonstrate existing studies related to our work, including location recommendation exploiting different types of data sources and location recommendation exploiting different additional information.

2.1 Location recommendation exploiting different types of data sources

Early location recommendation studies are mostly based on full GPS trajectories [2,3,4, 28]. Zheng et al. [2, 3] leveraged GPS trajectory data recorded by multiple users to mine and recommend interesting locations as well as classical sequences within a given geospatial region. Due to privacy issues, it is difficult for the researchers to obtain a large amount of full GPS trajectory data [29]. Therefore, researchers seek to extract partial GPS trajectories from other easily accessed data sources, e.g., check-ins and geo-tagged photos.

Nowadays, with the development of Location-Based Social Networks (LBSNs) [29], people can share their locations with others and make comments on their visited locations, which causes check-in based location recommendation. Early studies mainly take social relationship into account to provide more appropriate recommendation results [30, 31]. Long and Joshi [31] proposed a Hypertext Induced Topic Search (HITS) based Point of Interest (POI) recommendation algorithm, which recommends POIs based on users’ social network links and check-in behaviour. Gao et al. [30] proposed a social-historical model to explore users’ check-in behaviour, which integrates social and historical effects, and assesses the role of social correlation in users’ check-in behaviour. To improve recommendation performance, different additional information (e.g., sequential, textual, geographical, and visual information) has been introduced, which we will introduce in the following subsection.

Geo-tagged photos provide rich location-based data, which can be exploited for location recommendation. Early studies mainly take users’ preferences into account to recommend locations [9,10,11]. Clements et al. [9] firstly defined a similarity between the geotag distributions of two users based on a Gaussian kernel convolution, and predicted a user’s favourite locations in a new city based on the rankings of the most similar users in the target city. Mamei et al. [10] proposed to recommend interesting locations in a new city to a user by exploiting instance-based Pearson collaborative filtering (CF). Popescu and Grefenstette [11] firstly mined users’ location visiting records to build a user-user similarity matrix, and recommended interesting locations in a new destination based on the experience of like-minded users who already visited that destination. To improve recommendation performance, different additional information (e.g., sequential, textual, geographical, and visual information) has been introduced, which we will introduce in the following subsection.

2.2 Location recommendation exploiting different additional information

Sequential information

Since human movement has strong sequential patterns, i.e., the probability of the transition from one location to another is non-uniform [32], researchers tend to exploit sequential information. Markov chain models, Recurrent Neural Networks (RNNs), and word2vec have been used to model location sequences in the previous works. Kurashima et al. [14] used a first-order Markov chain model to predict the visit probability of a next location dependent on the previous location. Yang et al. [12] employed RNN and Gated Recurrent Unit (GRU) models to capture short and long term sequential information in mobile trajectories, respectively. Liu et al. [13] leveraged word2vec to learn the latent representation of a location by capturing the influence of its context locations. Inspired by the success of word2vec to capture sequential information, we exploit doc2vec [33] (an extension of word2vec, which can learn the vectors of documents) to get the sequential representations of users and locations.

Textual information

Since locations visited by the same user tend to be semantically similar [34], researchers tend to exploit textual information. Term Frequency-Inverse Document Frequency (TF-IDF), topic models, and sentiments have been used to capture textual information in the previous works. Majid et al. [15, 16] and Memon et al. [35] leveraged TF-IDF to process the textual tags of geo-tagged photos. Jiang et al. [17] extracted the topics of users’ preferences from the textual tags of geo-tagged photos via an author topic model. Gao et al. [7] studied the relationship between users’ check-in behaviour and sentiment indications extracted from their tips. Inspired by the success of topic models to capture textual information, we exploit Latent Dirichlet Allocation (LDA) [36] to get the textual representations of users and locations.

Geographical information

Since there is a strong correlation between users’ check-in behaviour and geographical distance, researchers tend to exploit geographical information. One major research direction assumes that the distances of visited locations follow a power law distribution derived from the whole check-in history of all users. Yuan et al. [18] and Ye et al. [5] fused CF with geographical information modeled by a power law distribution for POI recommendation. Another research direction firstly clusters the whole check-in history of all users to find the most popular locations as centers, and assumes that the distances between visited locations and their centers follow a multi-center Gaussian model. Cheng et al. [19] fused geographical information modeled by a multi-center Gaussian model into a generalized matrix factorization framework for POI recommendation. Inspired by the success of the power law distribution to capture geographical information, we exploit it to model the co-visit probabilities of locations.

Visual information

In addition to the above well-studied information, researchers have attempted to exploit visual information. Originally, users’ attributes extracted via face recognition have been exploited to build users’ profiles. Cheng et al. [22] extracted users’ attributes (e.g., gender, age, and race) from photo contents, which was extended by further considering travel group types (e.g., family, friends, and couple) [23]. Recently, deep neural networks have been exploited to extract visual features. Wang et al. [21] used a VGG16 model to extract visual features from images and incorporated them into a probabilistic model for learning the latent features of users and POIs. Zhang et al. [37] used a Bayesian stacked convolutional auto-encoder to extract visual features from images and incorporated them into a pair-wise ranking model. Inspired by the success of auto-encoders to capture visual information, we exploit Variational Auto-Encoder (VAE) [38] to get the visual representations of users and locations.

Multi-information

Recently, researchers also have exploited multi-information for location recommendation. Existing studies can be divided into two categories. The former learns features from additional information and users’ history simultaneously to make personalized recommendation [24, 25], which usually needs lots of training examples to learn good representations of users and locations. Zhou et al. [25] proposed a multi-context trajectory embedding framework, which embeds user-level, trajectory-level, location-level, and temporal contexts into a shared low-dimension space. Yang et al. [24] developed PACE, which jointly predicts user’s preferences over POIs, user’s friends, and POI’s nearby POIs, and ensures that users or POIs sharing more similar friends or nearby POIs will have closer embeddings.

The latter usually first extract features from additional information and then train a recommendation model based on users’ history by using these features as priors [26, 27], which can integrate different additional information flexibly and achieve comparable performance under fewer training examples. Xu et al. [27] firstly exploited gender and age information extracted from photos, POI category distribution, temporally fine-grained users’ preferences, etc., to profile users and travel locations, and then calculated user-user and travel location-travel location similarities to constrain the factorization of user-travel location matrix. Ding and Chen [39] proposed RecNet, which firstly factorizes co-visiting, geographical proximity, and categorical correlation matrices to obtain the embeddings of POIs and users, and then feeds the embedded POIs and users into a deep neural network to adaptively learn high-order interactions between them. However, there exists no method that exploits all of the above-mentioned additional information, and the significance of different additional information is not well studied.

3 Preliminaries and problem definition

In this section, we give the definitions of some basic concepts and terms, and then formally define the problem.

Definition 1

(Geo-tagged photo) A geo-tagged photo can be defined as p = (u, g, t, T), where u is the user who contributed the photo, g is the coordinate where the photo was taken, t is the time-stamp when the photo was taken, and T is the textual tag set of the photo.

Definition 2

(User) A user is a person who has taken and shared geo-tagged photos online, and can be defined as u = ( P, T), where P is the set of geo-tagged photos taken by the user, and T is the set of textual tags belonging to all the photos taken by the user. A set of users can be defined as U = {u1, u2, ⋯, u|U|}.

Definition 3

(Travel location) A travel location can be defined as l = (c, g, P, T), where c is the city the location is in, g is its coordinate, P is the set of geo-tagged photos taken at the location, and T is the set of textual tags belonging to all the photos taken at the location. A set of travel locations can be defined as L = {l1, l2, ⋯, l|L|}.

Definition 4

(Visit) A visit can be defined as v = (u, l, t), which denotes that user u has visited travel location l at time t.

Definition 5

(User-travel location matrix) User-travel location matrix can be defined as M, whose element Mij (0 < i ≤ |U| ∧ 0 < j ≤ |L|) represents the visit frequency of user ui to travel location lj.

Definition 6

(Visit sequence). Given time threshold ∆T, the j th visit sequence of user ui is \( {S}_i^j=\left[\left({l}_1,{t}_1\right),\left({l}_2,{t}_2\right),\cdots, \left({l}_{\left|{S}_i^j\right|},{t}_{\left|{S}_i^j\right|}\right)\right] \), where \( 0<{t}_{k+1}-{t}_k\le \Delta T\ \left(1\le k<\left|{S}_i^j\right|\right)\wedge \left|{S}_i^j\right|\ge 2 \). \( {SS}_i=\left\{{S}_i^1,{S}_i^2,\cdots, {S}_i^{\left|{SS}_i\right|}\right\} \) denotes the visit sequence set of the user.

Our research problem can be formulated as: Given target user u and query city c where the user wants to travel for the first time, i.e., the query is q = (u, c), our purpose is to recommend a set of travel locations in city c that user u would be interested in.

4 Method

The framework of the proposed method is shown in Fig. 1. Firstly, the geo-tagged photos are clustered to find travel locations. Based on the visits identified by consecutive photos taken during a threshold time period by a same user at a same travel location, we build the original user-travel location matrix M and extract users’ visit sequences. After that, we profile users and travel locations by exploiting the visual, sequential, and textual information of geo-tagged photos via VAE, doc2vec, and LDA, respectively. In addition, the co-visit probabilities of travel locations are modeled by a power-law distribution based on their geographical distances. Afterwards, different weights are assigned to visual, sequential, and textual similarities as well as co-visit probabilities to obtain weighted user-user and travel location-travel location similarities, which are then used as regularization terms to constrain the factorization of M. Finally, we get the completed user-travel location matrix R, according to which, we can recommend some suitable travel locations to the target user.

Fig. 1
figure 1

The framework of the proposed method

4.1 Find travel locations

Since people usually take a lot of photos at travel locations, finding travel locations can be regarded as a problem of recognizing places photographed frequently. Researchers have used clustering algorithms to extract travel locations, e.g., mean shift [14, 22, 23], OPTICS [2], and DBSCAN [40, 41]. Generally speaking, DBSCAN has the following advantages compared to other clustering algorithms: 1) need minimum domain knowledge to determine the parameters (do not need to determine the number of clusters in advance) and be able to identify clusters of arbitrary shapes, 2) can filter out abnormal points, 3) maintains high efficiency when dealing with large-scale data. However, DBSCAN fails to meet our requirements, as it holds a uniform density threshold for all clusters, while clusters that we would like to extract from geo-tagged photos may have different sizes and densities. In order to solve this problem, Kisilevich et al. [42] proposed P-DBSCAN algorithm, which extends the definition of directly density-reachable by using adaptive density. Therefore, we use P-DBSCAN algorithm to find travel locations from geo-tagged photos, i.e., obtaining a set of travel locations L.

4.2 Build the original matrix and extract visit sequences

The visit frequency of a user to a travel location indirectly reflects the degree that the user prefers the location. Therefore, we first get the visit frequency of each user-travel location pair. Like [43], first of all, we use geo-tagged photos taken by different users at different travel locations to identify each visit. Specifically, given a user-travel location pair, we first sort the geo-tagged photos according to their taken time. Since a user might take more than one geo-tagged photo within one visit, if the taken time interval of two consecutive photos is less than visit duration threshold tthr, we presume that these two photos belong to the same visit. The mean taken time of the photos belonging to a same visit is regarded as the visit time. After processing all users’ visit history, we can count the visit frequency of each user-travel location pair, to get the original user-travel location matrix M.

Considering that users’ visit order of travel locations usually can reflect their travel preferences to a certain extent, we extract users’ visit sequences based on their visit history. Specifically, we firstly sort the visit history of user ui according to visit time. Then, according to time threshold ∆T, we segment ui’s visit history to obtain the visit sequence set of the user, i.e., SSi.

4.3 Profile users and travel locations

In this section, we consider visual, sequential, and textual information to profile users and travel locations.

4.3.1 Visual information modeling

Photos taken by users at travel locations contain a large amount of visual information. In order to make full use of it, we use VAE to get important visual characteristics from photos.

VAE [38] is an unsupervised learning method to learn complicated distributions of data, whose loss function is given by Eqs. 14. Given all the geo-tagged photos X ∈ Rr × 32 × 32 × 3, where r is the number of photos, we assume that X is generated by a directed graphical model P(X| z), and the encoder is learning an approximation Q(z| X) to the posterior distribution P(z| X).

$$ {L}_{\mathrm{VAE}}={E}_{\boldsymbol{z}\sim Q}\left[\log P\left(\boldsymbol{X}|\boldsymbol{z}\right)\right]-{D}_{KL}\left(Q\left(\boldsymbol{z}|\boldsymbol{X}\right)\Big\Vert P\left(\boldsymbol{z}\right)\right) $$
(1)
$$ P\left(\boldsymbol{z}\right)=N\left(\boldsymbol{z}|\mathbf{0},\boldsymbol{I}\right) $$
(2)
$$ Q\left(\boldsymbol{z}|\boldsymbol{X}\right)=N\left(\boldsymbol{z}|\boldsymbol{\mu}, {\boldsymbol{\sigma}}^2\ast \boldsymbol{I}\right) $$
(3)
$$ P\left(\boldsymbol{X}|\boldsymbol{z}\right)=N\left(\boldsymbol{X}|{\boldsymbol{\mu}}^{\prime },{{\boldsymbol{\sigma}}^{\prime}}^2\ast \boldsymbol{I}\right) $$
(4)

where the latent variable z obeys normal distribution, and is viewed as the visual features of photos X. Specifically, the visual feature of photo Xk (0 ≤ k < r) is zk. I is an identity matrix. Q(z| X) gives a distribution over z values that are likely to produce X. P(X| z) measures the amount of information required to reconstruct X from z under an ideal encoding. μ, σ, μ, and σare arbitrary deterministic functions that can be learned from data, and are implemented via neural networks.

The visual features of all the photos taken at travel location lj are averaged to get the visual representation of the travel location, denoted as \( {\boldsymbol{v}}_{l_j} \); meanwhile, the visual features of all the photos taken by user ui are averaged to get the visual representation of the user, denoted as \( {\boldsymbol{v}}_{u_i} \).

4.3.2 Sequential information modeling

Since the order how users visit travel locations can reflect their travel preferences to a certain extent, we use doc2vec to get the sequential representations of users and travel locations.

Doc2vec [33] was proposed to learn the vector of variable length text (e.g., sentences, paragraphs, and documents) and is based on word2vec [44]. It can be divided into Distributed Memory version of Paragraph Vector (PV-DM) and Distributed Bag of Words version of Paragraph Vector (PV-DBOW). Since PV-DM considers the concatenation of the document vector with the word vectors of context words to predict the next word in a text window, while PV-DBOW ignores context words, we use PV-DM.

An illustrative framework of PV-DM is shown in Fig. 2. Treating user ui’s visit sequence set SSi as a document, PV-DM predicts the next travel location lj by considering the user and context travel locations lj − w to lj + w, whose objective function is to maximize the average log probability in Eqs. 57:

$$ L\left({SS}_i\right)=\frac{1}{\left|{SS}_i\right|}\sum \limits_{l_j\in {SS}_i}p\left({l}_j|{u}_i,{l}_{j-w},\cdots, {l}_k\cdots, {l}_{j+w}\right) $$
(5)
$$ p\left({l}_j|{u}_i,{l}_{j-w},\cdots, {l}_k\cdots, {l}_{j+w}\right)=\frac{\exp \left({{\boldsymbol{s}}_j}^{\mathrm{T}}\bullet \boldsymbol{v}\right)}{\sum \limits_{l_{j^{\prime }}\in L}\exp \left({{\boldsymbol{s}}_{l_{j^{\prime}}}}^{\mathrm{T}}\bullet \boldsymbol{v}\right)} $$
(6)
$$ \boldsymbol{v}=\left({\boldsymbol{s}}_{u_i}+{\boldsymbol{s}}_{l_{j-w}}+\mathbf{\cdots}+{\boldsymbol{s}}_{l_k}\mathbf{\cdots}+{\boldsymbol{s}}_{l_{j+w}}\right)/\left(2w+1\right) $$
(7)

where w is the size of context window. \( {l}_{j^{\prime }}\ \left({l}_j\ne {l}_{j^{\prime }}\right) \) is one of the travel locations in the travel location set, and lk (j − w < k < j + w, k ≠ j) is one of the context travel locations. \( {\boldsymbol{s}}_{l_j} \), \( {\boldsymbol{s}}_{l_{j^{\prime }}} \), and \( {\boldsymbol{s}}_{l_k} \) are the embeddings of travel locations lj, \( {l}_{j^{\prime }} \), and lk, respectively. \( {\boldsymbol{s}}_{u_i} \) is the embedding of user ui.

Fig. 2
figure 2

The framework of PV-DM

4.3.3 Textual information modeling

In addition to the rich visual information contained in geo-tagged photos, their corresponding metadata also hold a large amount of textual information, which is also very important to profile users and travel locations [14, 16].

LDA is a representative topic model [45], which was first put forward and applied in the field of natural language processing. In LDA, each document is a probability distribution of a series of topics, and each word’s presence is attributable to one of the document’s topics.

Treating a textual tag as a word, and the textual tag set of a user as a document, the generation process is as follows (as shown in Fig. 3):

  1. 1)

    For i ∈ [1, k]

Fig. 3
figure 3

The graph model of LDA topic model

Sample φi~Dir(β), where φi is the word distribution of the i th topic, k is the number of topics, and Dir(β) is the Dirichlet distribution of parameter β:

  1. 2)

    For i ∈ [1, |U|]

Sample \( {\boldsymbol{\theta}}_{u_i}\sim \mathrm{Dir}\left(\boldsymbol{\alpha} \right) \), where \( {\boldsymbol{\theta}}_{u_i} \) is the topic distribution of user ui’s textual tag set, and Dir(α) is the Dirichlet distribution of parameter α.

  1. 3)

    For each textual tag in ui’s textual tag set:

Sample topic \( z\sim \mathrm{Multinomial}\left({\boldsymbol{\theta}}_{u_i}\right) \).

Sample word w~Multinomial(φz).

Similarly, treating a textual tag as a word, and the textual tag set of a travel location as a document, we can obtain the topic distributions of travel locations. Specifically, the topic distribution of travel location lj’s textual tag set is \( {\boldsymbol{\theta}}_{l_j} \).

4.4 Calculate user similarity and travel location similarity

We use cosine similarity, which is one of the most popular similarity measures [27, 46, 47], to measure user-user and travel location-travel location similarities based on visual, sequential, and textual information. For example, visual similarity between two users ui and uj can be calculated by Eq. 8:

$$ \mathrm{Sim}\left({\boldsymbol{v}}_{u_i},{\boldsymbol{v}}_{u_j}\right)=\frac{{\boldsymbol{v}}_{u_i}\bullet {\boldsymbol{v}}_{u_j}}{{\left\Vert {\boldsymbol{v}}_{u_i}\right\Vert}_F^2\times {\left\Vert {\boldsymbol{v}}_{u_j}\right\Vert}_F^2} $$
(8)

where \( {\left\Vert {\boldsymbol{v}}_{u_i}\right\Vert}_F^2 \) and \( {\left\Vert {\boldsymbol{v}}_{u_j}\right\Vert}_F^2 \) represent the Frobenius norms of \( {\boldsymbol{v}}_{u_i} \) and \( {\boldsymbol{v}}_{u_j} \), respectively. Similarly, sequential and textual similarities between ui and uj, as well as visual, sequential, and textual similarities between two travel locations li and lj can be calculated, denoted by, \( \mathrm{Sim}\left({\boldsymbol{s}}_{u_i},{\boldsymbol{s}}_{u_j}\right) \), \( \mathrm{Sim}\left({\boldsymbol{\theta}}_{u_i},{\boldsymbol{\theta}}_{u_j}\right) \), \( \mathrm{Sim}\left({\boldsymbol{v}}_{l_i},{\boldsymbol{v}}_{l_j}\right) \), \( \mathrm{Sim}\left({\boldsymbol{s}}_{l_i},{\boldsymbol{s}}_{l_j}\right) \), and \( \mathrm{Sim}\left({\boldsymbol{\theta}}_{l_i},{\boldsymbol{\theta}}_{l_j}\right) \), respectively.

In addition, based on the coordinates, the geographical distance between li and lj can be obtained, denoted as dis(li, lj). Then, we use a power law distribution [5] to model the co-visit probability of li and lj , which is given by Eq. 9:

$$ y\left({l}_i,{l}_j\right)=a\times \mathrm{dis}{\left({l}_i,{l}_j\right)}^b $$
(9)

where a and b are the parameters of the power-law distribution, which can be learned by linear regression.

Finally, weighted user-user and travel location-travel location similarities can be calculated by Eq. 10 and Eq. 11, respectively:

$$ \mathrm{Sim}\mathrm{U}\left({u}_i,{u}_j\right)={w}_1\times \mathrm{Sim}\left({\boldsymbol{v}}_{u_i},{\boldsymbol{v}}_{u_j}\right)+{w}_2\times \mathrm{Sim}\left({\boldsymbol{s}}_{u_i},{\boldsymbol{s}}_{u_j}\right)+\left(1-{w}_1-{w}_2\right)\times \mathrm{Sim}\left({\boldsymbol{\theta}}_{u_i},{\boldsymbol{\theta}}_{u_j}\right) $$
(10)
$$ \mathrm{Sim}\mathrm{L}\left({l}_i,{l}_j\right)={w}_3\times \mathrm{Sim}\left({\boldsymbol{v}}_{l_i},{\boldsymbol{v}}_{l_j}\right)+{w}_4\times \mathrm{Sim}\left({\boldsymbol{s}}_{l_i},{\boldsymbol{s}}_{l_j}\right)+{w}_5\times \mathrm{Sim}\left({\boldsymbol{\theta}}_{l_i},{\boldsymbol{\theta}}_{l_j}\right)+\left(1-{w}_3-{w}_4-{w}_5\right)\times y\left({l}_i,{l}_j\right) $$
(11)

where w1, w2, w3, w4, and w5 are similarity weights.

4.5 Factorize the original matrix

Matrix factorization has been successfully used in recommender systems [48,49,50,51]. It can find the latent variables between users and items, which can reflect the characteristics of users and items to a certain extent. Through matrix factorization, the original user-travel location matrix M can be approximated by the product of two factorized matrices U and L, which is given by Eq. 12.

$$ \boldsymbol{M}\approx {\boldsymbol{U}}^{\mathrm{T}}\boldsymbol{L}=\boldsymbol{R} $$
(12)

The objective function of matrix factorization is given by Eq. 13.

$$ V=\underset{\boldsymbol{U},\boldsymbol{L}}{\min}\frac{1}{2}\sum \limits_{0<i\le \left|U\right|,0<j\le \left|L\right|}{\left({\boldsymbol{M}}_{ij}-{\boldsymbol{U}}_i^{\mathrm{T}}{\boldsymbol{L}}_j\right)}^2+\frac{\lambda }{2}\left({\left\Vert \boldsymbol{U}\right\Vert}_F^2+{\left\Vert \boldsymbol{L}\right\Vert}_F^2\right) $$
(13)

where \( {\left\Vert \boldsymbol{U}\right\Vert}_F^2 \) and \( {\left\Vert \boldsymbol{L}\right\Vert}_F^2 \) represent the Frobenius norms of matrices U and L, respectively, and λ is a regularized parameter that is used to prevent over-fitting.

Since visit frequency is a kind of implicit feedback, we exploit weighted matrix factorization [52], whose objective function is given by Eq. 14:

$$ V=\underset{\boldsymbol{U},\boldsymbol{L}}{\min}\frac{1}{2}\sum \limits_{0<i\le \left|U\right|,0<j\le \left|L\right|}{\boldsymbol{C}}_{ij}{\left({\boldsymbol{P}}_{ij}-{\boldsymbol{U}}_i^{\mathrm{T}}{\boldsymbol{L}}_j\right)}^2+\frac{\lambda }{2}\left({\left\Vert \boldsymbol{U}\right\Vert}_F^2+{\left\Vert \boldsymbol{L}\right\Vert}_F^2\right) $$
(14)

where Cij measures our confidence in observing Mij, which can be calculated by Eq. 15. Pij = 1, if Mij > 0; otherwise Pij = 0.

$$ {\boldsymbol{C}}_{ij}=1+\gamma \times {\boldsymbol{M}}_{ij} $$
(15)

where γ is the ratio to balance the zero and non-zero elements in M.

In order to leverage the profiles of users and travel locations, we introduce user-user and travel location-travel location similarities to constrain the matrix factorization process, which is inspired by some former works [19, 27]. The objective function is transformed correspondingly, which is given by Eq. 16.

$$ V=\underset{\boldsymbol{U},\boldsymbol{L}}{\min}\frac{1}{2}\sum \limits_{0<i\le \left|U\right|,0<j\le \left|L\right|}{\boldsymbol{C}}_{ij}{\left({\boldsymbol{P}}_{ij}-{\boldsymbol{U}}_i^{\mathrm{T}}{\boldsymbol{L}}_j\right)}^2+\frac{\lambda }{2}\left({\left\Vert \boldsymbol{U}\right\Vert}_F^2+{\left\Vert \boldsymbol{L}\right\Vert}_F^2\right)+\frac{\beta }{2}\left(\sum \limits_{0<i,j\le \left|U\right|}\mathrm{SimU}\left({u}_i,{u}_j\right){\left\Vert {\boldsymbol{U}}_i-{\boldsymbol{U}}_j\right\Vert}_F^2+\sum \limits_{0<i,j\le \left|L\right|}\mathrm{SimL}\left({l}_i,{l}_j\right){\left\Vert {\boldsymbol{L}}_i-{\boldsymbol{L}}_j\right\Vert}_F^2\right) $$
(16)

where β is a parameter to control the significance of user-user and travel location-travel location similarities.

We update U and L alternately. Specifically, we compute the gradient of Eq. 16 with respect to U when fixing L to update U, which is given by Eq. 17. Similarly, we update L by Eq. 18:

$$ \boldsymbol{U}\leftarrow \boldsymbol{U}+\alpha \left(\boldsymbol{C}\ast \left(\boldsymbol{P}-{\boldsymbol{U}}^{\mathrm{T}}\boldsymbol{L}\right){\boldsymbol{L}}^{\mathrm{T}}-\beta \sum \limits_{0<i,j\le \left|U\right|}\mathrm{SimU}\left({u}_i,{u}_j\right)\left({\boldsymbol{U}}_i-{\boldsymbol{U}}_j\right)-\lambda \boldsymbol{U}\right) $$
(17)
$$ \boldsymbol{L}\leftarrow \boldsymbol{L}+\alpha {\boldsymbol{U}}^{\mathrm{T}}\left(\boldsymbol{C}\ast \left(\boldsymbol{P}-{\boldsymbol{U}}^{\mathrm{T}}\boldsymbol{L}\right)-\beta \sum \limits_{0<i,j\le \left|L\right|}\mathrm{SimL}\left({l}_i,{l}_j\right)\left({\boldsymbol{L}}_i-{\boldsymbol{L}}_j\right)-\lambda \boldsymbol{L}\right) $$
(18)

where α is the learning rate, and ∗ denotes the bit-wise product.

4.6 Travel location recommendation

After minimizing Eq. 16, we can get optimized U and L. By multiplying matrices U and L, we can get the completed user-travel location matrix R, which can recover the missing values in M, and Rij is the preference score of user ui for travel location lj.

Given target user u and query city c where the user wants to travel (i.e., the query is q = (u, c)), we first obtain the user’s preference scores for all the travel locations in city c from the completed matrix R. Based on preference scores, we return top n travel locations as the results of the query.

5 Experiments

5.1 Dataset

We use the public API of Flickr to collect 736,383 geo-tagged photos that were taken in six cities in China between 1 January 2001 and 1 July 2011 [16]. We use the method introduced in Section 4.1 to find travel locations in these six cities. Table 1 shows the corresponding numbers of users and travel locations in each city. After using the method introduced in Section 4.2 to identify users’ visits, we get 4386 visits for 882 users at 1514 travel locations, which results in a user-travel location matrix with 99.67% sparsity.

Table 1 The numbers of users and travel locations in each city

5.2 Experiment settings

In this section, we first give the settings of some important parameters.

  1. 1)

    The parameters of P-DBSCAN: We set minPts = 50 photos, ε = 100 m, and density ratio ω = 0.5 for P-DBSCAN referring to [16]. An example of the clustering results in Hangzhou is shown in Fig. 4.

  2. 2)

    Visit duration threshold tthr: We set tthr = 6 hours referring to [16].

  3. 3)

    The network structure of VAE: We construct VAE with Keras, and the network structure is shown in Table 2.

  4. 4)

    The parameters in matrix factorization: We vary the values of β, λ, and α over [0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.3], and find that the optimal values are β = 0.0003, λ = 0.001, and α = 0.3. We tune parameters γ and d with experiments in Section 5.3.

  5. 5)

    Time threshold ∆T in visit sequence: We tune the parameter with experiments in Section 5.4.

  6. 6)

    Topic number k in topic model: We tune the parameter with experiments in Section 5.5.

  7. 7)

    Similarity weights: We vary the values of similarity weights over [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9], and find that the optimal values are w1 = 0.5, w2 = 0.1, w3 = 0.4, w4 = 0.2, and w5 = 0.3.

Fig. 4
figure 4

The clustering results in Hangzhou

Table 2 The network structure of VAE

In the following experiments, we select users who have visited at least three cities. For each user, we sort the visited cities according to visit time. The data of the last visited city is chosen for testing, while the data of the second to last visited city is chosen to tune parameters, and the rest are used for training. According to the above dividing strategy, we split the dataset \( \mathcal{D} \) into training set \( {\mathcal{D}}_{\mathrm{train}} \), validation set \( {\mathcal{D}}_{\mathrm{valid}} \), and test set \( {\mathcal{D}}_{\mathrm{test}} \).

To evaluate recommendation performance, we employ MAP@n as the performance metric, which is calculated by Eqs. 1920:

$$ \mathrm{MAP}@n=\left({\sum}_{i=1}^m\mathrm{AP}@n\right)/m $$
(19)
$$ \mathrm{AP}@n=\left({\sum}_{i=1}^n\left({\sum}_{j=1}^i{f}_j\right)/i\right)/n $$
(20)

where n indicates the number of travel locations to be recommended, and m represents the number of target users. fj is 1 if the user has actually visited the j th location in the returned results, otherwise fj is 0.

5.3 The effects of ratio γ and latent variable number d in matrix factorization

Ratio γ controls the rate of confidence increase. If the value of γ is too small, the confidence levels of a same user’s different visit frequencies cannot be well distinguished; on the contrary, the confidence levels may be extremely biased, as visiting a travel location can also be the result of factors different from preferring it [52]. Latent variable number d may affect the performance of the proposed method. If the value of d is too small, it would be difficult to make a clear distinction between users and travel locations; on the contrary, the computational complexity would increase dramatically. We use \( {\mathcal{D}}_{\mathrm{train}} \) to build models and evaluate on \( {\mathcal{D}}_{\mathrm{valid}} \) to study the effects of ratio γ and latent variable number d. Specifically, we increase one parameter from 5 to 50 with a step size of 5 while fixing the other parameter to 5. The results are shown in Fig. 5 and Fig. 6, from which we can find that when γ is 15 and d is 25, MAP@5 and MAP@10 reach the highest. Therefore, we set γ = 15 and d = 25 in the following experiments.

Fig. 5
figure 5

The MAP under different values of γ

Fig. 6
figure 6

The MAP under different values of d

5.4 The effects of time threshold ∆T in visit sequence

Time threshold ∆T determines the length and amount of visit sequences. When ∆T is small, the extracted visit sequences are generally too short to obtaining sequential representations; while when ∆T takes a larger value, more noise would be introduced. We use \( {\mathcal{D}}_{\mathrm{train}} \) to build models and evaluate on \( {\mathcal{D}}_{\mathrm{valid}} \) to study the effects of time threshold ∆T. The results are shown in Table 3, from which we can find that MAP@5 and MAP@10 reach the highest when ∆T is 12. Therefore, ∆T is set to 12 in the following experiments.

Table 3 The MAP under different values of ∆T (mean ± standard deviation)

5.5 The effects of topic number k in topic model

Topic number k determines the expressiveness of a topic model. When k is small, the extracted topics are not expressive enough to represent the documents; while when k takes a larger value, more noise would be introduced. We use \( {\mathcal{D}}_{\mathrm{train}} \) to build models and evaluate on \( {\mathcal{D}}_{\mathrm{valid}} \) to study the effects of topic number k. The results are shown in Fig. 7, from which we can find that MAP@5 and MAP@10 reach the highest when k is 20. Therefore, k is set to 20 in the following experiments.

Fig. 7
figure 7

The MAP under different values of k

5.6 The effects of different additional information

In order to study the effects of different additional information, we remove the visual, sequential, and textual similarities of users and travel locations, as well as co-visit probabilities, forming seven variants: WIND-MF-uvis, WIND-MF-useq, WIND-MF-utex, WIND-MF-lvis, WIND-MF-lseq, WIND-MF-ltex, and WIND-MF-dis.

We use \( {\mathcal{D}}_{\mathrm{train}} \) to build models and evaluate on \( {\mathcal{D}}_{\mathrm{test}} \). The parameters of the compared methods are well tuned to ensure fairness. The results are shown in Table 4, from which we can find:

  1. 1)

    Different additional information enhances recommendation performance to different degrees. According to influence degree, the sorting result of different additional information is: location visual effect > user visual effect > location sequential effect > user sequential effect > user textual effect > geographical effect > location textual effect.

  2. 2)

    For both users and travel locations, visual information plays the most important role, which indicates that uses like to visit travel locations with similar visual appearances [21]. Sequential information brings the second largest promotion on recommendation performance, as human movement usually exhibits strong sequential patterns [32].

  3. 3)

    Textual information is more important for users than for travel locations, which might be that the textual tags of geo-tagged photos are generated by users to express their feelings other than describe the travel locations.

  4. 4)

    Geographical information brings minor promotion on recommendation performance, which might be that most of the travel locations in a city are not far enough (76% pairs of travel locations in a city are less than 5 km apart) for users to take geographical distance information into account when making tradeoff decisions.

  5. 5)

    The performance of the proposed method is significantly higher than that of the seven variants, showing that considering visual, sequential, textual, and geographical information simultaneously can promote recommendation performance significantly.

Table 4 The effects of different additional information (mean ± standard deviation), * indicates WIND-MF is statistically superior to the compared method (pairwise t-test at a significance level of 5%)

5.7 The effects of the number of visited cities of the target user

In this section, we study the recommendation performance of the proposed method under different numbers of visited cities of the target user. The results are shown in Fig. 8, from which we can find that the more cities the target user has visited, the higher recommendation performance the proposed method can offer.

Fig. 8
figure 8

The MAP under different numbers of visited cities of the target user

5.8 The effects of the number of visited users in the Query City

In this section, we study the recommendation performance of the proposed method under different numbers of visited users in the query city. The results are shown in Fig. 9, from which we can find that the more users have visited the query city, the higher recommendation performance the proposed method can offer.

Fig. 9
figure 9

The MAP under different numbers of visited users in the query city

5.9 The comparison of different methods

In this section, the proposed method is compared with some state-of-the-art location recommendation methods. The compared methods are as follows:

  1. 1)

    Regularized Matrix Factorization based method (RMF) [51] is a baseline method based on matrix factorization without considering any additional information.

  2. 2)

    Dynamic Topic Model and Matrix Factorization based method (DTMMF) [27] firstly profiles users and travel locations as WIND-MF, and then concatenates visual, sequential, and textual representations to represent users and travel locations, on which user-user and travel location-travel location similarities are calculated to constrain the factorization of user-travel location matrix.

  3. 3)

    Multi-Context Trajectory Embedding Model (MC-TEM) [25] embeds user-level, visual-level, topic-level, sequence-level, and location-level contexts into a shared low-dimension space, and recommends travel locations that are close to the target user.

  4. 4)

    Author Topic model-based Collaborative Filtering method (ATCF) [17] firstly profiles users as WIND-MF, and then concatenates visual, sequential, and textual representations to represent users. User-user similarities are calculated based on the concatenated representations, and then user based CF is exploited for recommendation.

  5. 5)

    Preference And Context Embedding (PACE) [24] leverages a deep neural network to model non-linear complex feature interactions between users and travel locations, while exploiting user-visual words, user-topic, location-location (co-visiting), location-location (geographical proximity), location-visual words, and location-topic context graphs to ensure that users or locations sharing more similar contexts will have closer embeddings.

  6. 6)

    RecNet [39] firstly factorizes co-visiting, geographical proximity, topic correlation, and visual correlation matrices to obtain the embeddings of travel locations, and then embeds users according to their visited locations. A deep neural network is finally leveraged to learn high-order feature interactions.

We use \( {\mathcal{D}}_{\mathrm{train}} \) to build models and evaluate on \( {\mathcal{D}}_{\mathrm{test}} \). The parameters of the compared methods are well tuned to ensure fairness. The results are shown in Table 5, from which we can find:

  1. 1)

    The recommendation performance of RMF is weaker than that of all the other methods, which might be that RMF infers users’ preferences only by factorizing the sparse user-travel location matrix, without considering any additional information.

  2. 2)

    The recommendation performance of MC-TEM is better than that of ATCF, which might be that MC-TEM leverages additional information to embed both users and travel locations, while ATCF only leverages additional information to profile users.

  3. 3)

    The recommendation performance of PACE and RecNet is better than that of MC-TEM, which might be that deep neural network based methods are able to learn high-order feature interactions. In addition, MC-TEM considers different additional information together by using a united conditional probability function, which may aggravate the sparsity problem.

  4. 4)

    The recommendation performance of RecNet is better than that of PACE, which might be that RecNet embeds users according to their visited locations, which can reduce the amount of parameters that are needed to train.

  5. 5)

    The recommendation performance of DTMMF is better than that of RecNet, which might be that DTMMF uses features extracted from additional information as priors to factorize the user-travel location matrix, which has fewer parameters compared to RecNet that uses a deep neural network to learn high-order feature interactions, thus can achieve better performance given a sparse user-travel location matrix.

  6. 6)

    The recommendation performance of WIND-MF is better than that of DTMMF, which might be that WIND-MF can capture the significance of different additional information by assigning different weights, while DTMMF assigns a same weight to different additional information.

Table 5 The comparison of different location recommendation methods (mean ± standard deviation), * indicates WIND-MF is statistically superior to the compared method (pairwise t-test at a significance level of 5%)

6 Conclusions and future work

In this paper, we propose WIND-MF for personalized travel location recommendation based on geo-tagged photos. The method profiles users and travel locations based on the visual, sequential, and textual information of geo-tagged photos, and models co-visit probabilities based on geographical distances. Visual, sequential, and textual similarities as well as co-visit probabilities are assigned with different weights, to obtain weighted user-user and travel location-travel location similarities, which are then used as regularization terms to constrain the factorization of user-travel location matrix. The experiment results show the superiority of the proposed method. We also find that visual and sequential information contributes most to improving recommendation performance.

The proposed method also has space to further expand. We intend to introduce users’ preferences to guide visual feature extracting from photos. In addition, more context information (e.g., time, weather, and traffic) can be introduced to make the proposed method to be context-aware.