1 Introduction

Imagine that you are visiting a new country, and you are traveling among different cities. In each city, you will encounter countless places, and you may see a fancy building, experience some natural wild beauty, or enjoy the unique culture, etc. All of these experiences impress you and lead you to a deeper understanding of the place. As we browse through a city, based on certain common visual elements therein, we implicitly establish connections between visual characteristics with other multi-faceted information, such as its function, socioeconomic status and culture. We therefore believe that it would be a rewarding adventure to move beyond conventional place categorization and explore the connections among different aspects of a place. It indicates that multi-dimension labels are essential for comprehensive place understanding. To support this exploration, a large-scale dataset that cover a diverse set of places with both images and comprehensive multi-faceted information is needed.

Fig. 1.
figure 1

Hierarchical structure of Placepedia with places from all over the world. Each place is associated with its district, city/town/village, state/province, country, continent, and a large amount of diverse photos. Both administrative areas and places have rich side information, e.g. description, population, category, function, which allows various large-scale studies to be conducted on top of it

However, existing datasets for place understanding  [40, 43, 69], as shown in Table 1 are subject to at least one of the following drawbacks: 1) Limited Scale. Some of them [42, 43] contain only several thousand images from one particular city. 2) Restrictive Scope. Most datasets are constructed for only one task, e.g. place retrieval [40] or scene recognition [24, 69]. 3) Lack of Attributes. These datasets often contain just a very limited set of attributes. For example, [40] contains just photographers and titles. Clearly, these datasets, due to their limitations in scale, diversity, and richness, are not able to support the development of comprehensive place understanding.

In this work, we develop Placepedia, a comprehensive place dataset that contains images for places of interest from all over the world with massive attributes, as shown in Fig. 1. Placepedia is distinguished in several aspects: 1) Large Scale. It contains over 35M images from 240K places, several times larger than previous ones. 2) Comprehensive Annotations. The places in Placepedia are tagged with categories, functions, administrative divisions at different levels, e.g. city and country, as well as lots of multi-faceted side information, e.g. descriptions and coordinates. 3) Public Availability. Placepedia will be made public to the research community. We believe that it will greatly benefit the research on comprehensive place understanding and beyond.

Table 1. Comparing Placepedia with other existing datasets. Placepedia offers the largest number of images and the richest information

Meanwhile, Placepedia also enables us to rigorously benchmark the performance of existing and future algorithms for place recognition. We create four benchmarks based on Placepedia in this paper, namely place retrieval, place categorization, function categorization, and city/country recognition. By comparing different methods and modeling choices on these benchmarks, we gain insights into their pros and cons, which we believe would inspire more effective techniques for place recognition. Furthermore, to provide a trigger for comprehensive place understanding, we develop PlaceNet, a unified deep network for multi-level place recognition. It simultaneously predicts place item, category, function, city, and country. Experiments show that by leveraging the multi-level annotations in Placepedia, PlaceNet can learn better representation of a place than previous works. We also leverage both visual and side information from Placepedia to learn city embeddings, which demonstrate strong expressive power as well as the insights on what distinguish a city.

From the empirical studies on Placepedia, we see lots of challenges in performing place recognition. 1) The visual appearance can vary significantly due to the changes of angle, illumination, and other environmental factors. 2) A place may look completely different when viewed from inside and outside. 3) A big place, e.g.  a university, usually consists of a number of small places that have nothing in common in appearance. All these problems remain open. We hope that Placepedia, with its large scale, high diversity and massive annotations, would provide a gold mine for the community to develop more expressive models to meet the aforementioned challenges.

Our contributions in this work can be summarized as below. 1) We build Placepedia, a large-scale place dataset with comprehensive annotations in multiple aspects. To the best of our knowledge, Placepedia is the largest and the most comprehensive dataset for place understanding. 2) We design four task-specific benchmarks on Placepedia w.r.t. the multi-faceted information. 3) We conduct systematic studies on place recognition and city embedding, which demonstrate important challenges in place understanding as well as the connections between the visual characteristics and the underlying socioeconomic or cultural implications.

2 Related Work

Place Understanding Datasets. Datasets play an important role for various research topics in computer vision  [4, 19,20,21,22,23, 28, 31, 32, 34, 44, 45, 63, 65,66,67,68]. During the past decade, lots of place datasets were constructed to facilitate place-related studies. There are mainly three kinds of datasets. The first kind [6, 12, 40, 58, 59] focuses on the tasks of place recognition/retrieval, where images are labeled as particular place items, e.g. White House or Eiffel Tower. The second kind [69] targets place categorization or scene recognition. In these datasets, each image is attached with a place type, e.g. parks or museums. The third kind [24, 42, 43] is for object/image retrieval. The statistics is summarized in Tabler 1. Compared with these datasets, our Placepedia has much larger amount of image and context data, containing over 240K places with 35 million images labeled with 3K categories. Hence, Placepedia can be used for all these tasks. Besides, the provision of hierarchical administrative divisions for places allows us to study place recognition in different scales, e.g. city recognition or country recognition. Also, the function information (See, Do, Sleep, Eat, Buy, etc.) of places may lead to a new task, namely place function recognition.

Place Understanding Tasks. Lots of work aims at 2D place recognition [1, 2, 5, 6, 12, 13, 17, 25, 27, 33, 37, 38, 41, 46, 48, 52, 54, 58, 59, 71] or place retrieval [10, 11, 25, 40, 47, 55, 57, 61]. Given an image, the goal is to recognize what the particular place is or to retrieve images representing the same place. Scene recognition [29, 60, 62, 64, 69, 70], on the other hand, defines a diverse list of environment types as labels, e.g. nature, classroom, bar. And their job is to assign each image a scene type. [8, 50] collects images from several different cities, studies on distinguishing images of one city from others, and discovers what elements are characteristics of a given city. There also exist some other humanities-related studies. [53] classifies keywords of city description into Economic, Cultural, or Political, and then counts the occurrences of these three types to represent city branch. [35] uses satellite data from both daytime and nighttime to discover “ghost cities” which consist of abandoned buildings or housing structures which may hurt urbanization process. [26] collects place images from social media to extract human emotions at different places, to find out the relationship between human emotions and environment factors. [39] uses neural networks trained with map imagery to understand the connection among cities, transportation, and human health. Placepedia, with its large amount of data in both visual and textual domains, allows various studies to be conducted on top in large scale.

Fig. 2.
figure 2

The text in red, orange, green, blue, purple represents place names, cities, countries/territories, categories, functions, respectively. We see that the appearance of a place can vary from: 1) daytime to nighttime, 2) different angles, 3) inside and outside (Color figure online)

Fig. 3.
figure 3

(a) The number of places for top-10 functions. (b) The number of places for top-50 categories

3 The Placepedia Dataset

We contribute Placepedia, a large-scale place dataset, to the community. Some example images along with labels are shown in Fig. 2. In this section, we introduce the procedure of building Placepedia. First, a hierarchical structure is organized to store places and their multi-level administrative divisions, where each place/division is associated with rich side information. With this structure, global places are connected and classified on different levels, which allows us to investigate numerous place-related issues, e.g. city recognition [8, 50], country recognition, and city embedding. With types (e.g. park, airport) provided we can explore tasks such as place categorization [29, 69]. With functions (e.g. See, Sleep) provided we are able to model the functionality of places. Second, we download place images from Google Image, which are cleaned automatically and manually.

3.1 Hierarchical Administrative Areas and Places

Place Collection. We collect place items with side information from WikivoyageFootnote 1, a free worldwide travel guide website, through the public channel. Pages of Wikivoyage are organized in a hierarchical way, i.e., each destination is obtained by walking through its continent, country, state/province, city/town/village, district, etc. As illustrated in Fig. 1, these administrative areas serve as non-leaf nodes in Placepedia, and leaf nodes represent all the places. This process results in a list of 361, 524 places together with 24, 333 administrative areas.

Fig. 4.
figure 4

This figure shows ten function labels with five example places for each

Meta Data Collection. In Wikivoyage, all destinations are associated with some of the attributes below: function, category, description, GPS coordinates, address, phone number, opening hour, price, homepage, wikipedia link. The place number of top-10 functions and top-50 categories are shown in Fig. 3. Table 2 shows the definition for ten functions in Placepedia and Fig. 4 shows several examples of these ten functions. Function labels are the section names of places from Wikivoyage. Place functions serve as a good indicator for travelers to choose where to go. For example, some people love to go shopping when traveling; Some prefer to enjoy various flavors of food; Some people are addicted to distinctive landscapes. For administrative areas, Wikivoyage often lack meta data. Hence, we acquire the missing information by parsing their Wikipedia page. At last, the following attributes are extracted: description, GDP, population density, population, elevation, time zone, area, GPS coordinates, establish time, etc.

Place Cleaning. To refine the place list, we only keep places satisfying at least one of the two following criteria: 1) It has the attribute GPS coordinates or address; 2) It is identified as a location by Google Entity RecognitionFootnote 2 or Stanford Entity RecognitionFootnote 3. After the removal, 44, 997 items are deleted, and 316, 527 valid place entities remain.

3.2 Place Images

Image Collection. We collect all place images from Google Image engine in the public domain. For each location, its name plus its country is used as the keyword for searching. To increase the probability that images are relevant to a particular location, we only download those whose stem words of image titles contain all stem words of the location name. By this process, a total of over 30M images are collected from Google Image.

Image Cleaning. There are 28, 154 places containing Wikipedia links with 8, 125, 108 images. We use this subset to further study place-related tasks. Image set is refined by two stages. Firstly, we use Image Hashing to remove duplicate images. Secondly, we ask human annotators to remove irrelevant images for each place, including those: 1) whose main body represents another place; 2) that are selfie images with faces occupying a large proportion; 3) that are maps indicating the geolocation of the place. In total, 26K places with 5M images are kept to form this subset. For those places without category labels, we manually annotate the labels for them. And after merging some similar labels, we obtain 50 categories.

Placepedia also helps solve some problems on place understanding, like label confusion and label noise. On one hand, all the labels are collected automatically from the Wikivoyage website. Since it is a popular website that provides worldwide travel guidance and the labels in Wikivoyage have been well organized, there are less label confusion in Placepedia dataset. On the other hand, we have manually checked the labels in Placepedia, which would significantly reduce label noise.

Table 2. The description and examples for the 10 function labels of Places-Fine and Places-Coarse, which are collected from Wikivoyage

From the examples shown in Fig. 2, we observe that: 1) Images of places may look changeable from daytime to nighttime or during different seasons; 2) It can be significantly different viewed from multiple angles; 3) The appearances from inside and outside usually have little in common; 4) Some places such as universities span very large area and consist of different types of small places. These factors make place-related tasks very challenging. In the rest of this paper, we conduct a series of experiments to demonstrate important challenges in place understanding as well as strong expressive power of city embeddings.

4 Study on Comprehensive Place Understanding

This section introduces our exploration on comprehensive place understanding. Firstly, we carefully design benchmarks, and we evaluate the dataset with a lot of state-of-the-art models with different backbones for different tasks. Secondly, we develop a multi-task model, PlaceNet, which is trained to simultaneously predict place items, categories, functions, cities, and countries. This unified framework for place recognition can serve as a reasonable baseline for further studies in our Placepedia dataset. From the experimental results we also demonstrate the challenges of place recognition on multiple aspects.

4.1 Benchmarks

We build the following benchmarks based on the well-cleaned Placepedia subset, for evaluating different methods.

Datasets

  • Places-Coarse. We select 200 places for validation and 400 places for testing, from 50 famous cities of 34 countries. The remained 25K places are used for training. For validation/testing set, we double checked the annotation results. Places without category labels are manually annotated. After merging similar items of labels, we obtain 50 categories and 10 functions. The training/validation/testing set have 5M/60K/120K images respectively, from 7K cities of more than 200 countries.

  • Places-Fine. Places-Fine shares the same validation/testing set with Places-Coarse. For training set, we selected 400 places from the 50 cities of validation/testing places. Different from Places-Coarse, we also double checked the annotation of training data. The training/validation/testing set have 110K/60K/120K images respectively, which are tagged with 50 categories, 10 functions, 50 cities, and 34 countries.

Tasks

  • Place Retrieval. This task is to determine if two images belong to the same place. It is important when people want to find more photos of places they adore. For validation and testing set, 20 images for each place are selected as queries and the rest images form the gallery. Top-k retrieval accuracy is adopted to measure the performance of place retrieval, such that a successful retrieval is counted if at least one image of the same place has been found in the top-k retrieved results.

  • Place Categorization. This task is to classify places into 50 place categories, e.g. museums, parks, churches, temples. For place categorization, we employ the standard top-k classification accuracy as evaluation metric.

  • Function Categorization. This task is to classify places into 10 place functions: See, Do, Sleep, Eat, Drink, See & Do, Get In, Get Around, Buy, Learn. Again, we employ the standard top-k classification accuracy as evaluation metric.

  • City/Country Recognition. This task is to classify places into 50 cities or 34 countries. The goal is to determine what city/country an image belongs to. Also, the standard top-k classification accuracy is applied as evaluation metric.

Fig. 5.
figure 5

(a) Pipeline of PlaceNet, which learns five tasks simultaneously. (b) Pipeline of city embedding, which learns city representations considering both vision and text information

4.2 PlaceNet

We construct a CNN-based model to predict all tasks simultaneously. The training procedure performs in an iterative manner and the system is learned end-to-end.

Network Structures. The network structure of PlaceNet is similar to ResNet50 [16], which has been demonstrated powerful in various vision tasks. As illustrated in Fig. 5 (a), the structures of PlaceNet below the last convolution layer are the same as ResNet50. The last convolution/pooling/fc layers are duplicated to five branches, namely, place, category, function, city, and country, which is carefully designed for places. Each branch contains two FC layers. Different loss functions and pooling methods are studied in this work.

Loss Functions. We study three losses for PlaceNet, namely, softmax loss, focal loss, and triplet loss. Softmax loss or Focal loss [30] is adopted to classify place, category, function, city, and country. To learn the metric described by place pairs, we employ Triplet loss [49], which enforces distance constraints among positive and negative samples. When using triplet loss, the network is optimized by a combination of \(L_{softmax}\) and \(L_{triplet}\).

Pooling Methods. We also study different pooling methods for PlaceNet, namely, average pooling, max pooling, spatial pyramid pooling [15]. Spatial pyramid pooling (SPP) is used to learn multi-scale pooling, which is robust to object deformations and can augment data to confront overfitting.

Table 3. The experimental results for different methods on all tasks. We vary different pooling methods and loss functions for PlaceNet. Except for the last line, models are trained on Places-Fine. The figures in bold/blue indicate optimal/sub-optimal performance, respectively
Fig. 6.
figure 6

The 4 tables show the performance of 4 tasks, where each presents the most and the least accurate 5 classes. Below each table are 4 sets of examples, including 2 green/red dash boxes representing sample classes with high/low accuracies. Inside each dash box is the ground truth at the top and three images associated with predicted labels. Green/red solid boxes of images mean right/wrong predictions (Color figure online)

4.3 Experimental Settings

Data. We use Places-Fine and Places-Coarse defined in Sect. 4.1 as our experimental datasets. Note that Places-Fine and Places-Coarse share the same validation data and testing data, while training data size of the latter is much larger.

Backbone Methods. Deep Convolutional Neural Networks (CNNs) [14, 18, 28] have shown the impressive power for classification and retrieval tasks. Here we choose four popular CNN architectures, AlexNet [28], GoogLeNet [56], VGG16 [51], and ResNet50 [16], then train them on Places-Fine to create backbone models.

Training Details. We train each model for 90 epochs. For all tasks and all methods, the initial learning rate is set to be 0.5. And the learning rate is multiplied by 0.1 at epoch 63 and epoch 81. The weight decay is \(1e^{-4}\). For the optimizer, we use stochastic gradient descent with 0.9 momentum. We also augment the data following the operation on ImageNet, including randomly cropping and horizontally flipping the images. All images are resized to \(224\times 224\) and normalized with mean [123, 117, 109] and standard deviation [58, 56, 58]. Each model is pre-trained on ImageNet and then trained with our Placepedia Dataset in an end-to-end manner.

All experiments are conducted on Places-Fine. And we also train our PlaceNet on Places-Coarse to see if larger scale of datasets can further benefit the recognition performance.

4.4 Analysis on Recognition Results

Quantitative evaluations of different methods on the four benchmarks are provided. Table 3 summarizes the performance of different methods on all tasks. We first analyze the results on Places-Fine for all benchmark tasks.

Place Retrieval. PlaceNet with focal loss achieves the best retrieval results when evaluated using the top-1 accuracy. Some sample places with high/low accuracies are shown in Fig. 6 (a). We observe that: 1) Places with distinctive architectures can be easily recognized, e.g. Banco de México and Temple of Hephaestus. 2) For some parks, e.g. Franklin Park, there is usually no clear evidence to tell them from other parks. The same scenario can take place in categories such as gardens and churches. 3) Big places like Fun Spot Orlando may contain several small places, where their appearance may have nothing in common, which makes it very difficult to recognize. Places like resorts, towns, parks, and universities suffer the same issue.

Place Categorization. The best result is yielded by PlaceNet plus SPP. Some sample categories with high/low accuracies are shown in Fig. 6 (b). We observe that: 1) Zoos are the most distinctive. Intuitively, if animals are seen in a place, that is probably a zoo. However, photos in zoos may be mistaken for taking in parks. 2) Tombs can be confused with Pubs, due to bad illumination condition.

Function Categorization. The best setting for learning the function of a place is to use ResNet models. Some sample functions with high/low accuracies are shown in Fig. 6 (c). 1) See is recognized with the highest accuracy. 2) Some examples of Buy are very difficult to identify, e.g. the third image in Buy. Even human cannot tell what a street is mainly used for. Is it for shopping, eating, or just for transportation? Same logic applies to shops. The images of Eat are often categorized as shops for buying or drinking. One possible way to recognize the function of a shop is to extract and analyze its name, or to recognize and classify the food type therein. 3) Universities are often unrecognized either, due to its large area with various buildings/scenes.

City/Country Recognition. From Fig. 6 (d), we observe that: 1) Cities with long history (e.g. Florence, Beijing, Cairo) are more likely to distinguish from others, because they often preserve the oldest arts and architectures. 2) Travelers often conclude that Taiwan and Japan look quite alike. The results do show that places of Taipei may be regarded as in Tokyo. 3) Although places can be wrongly classified to another city, the prediction often belongs to the same country with the ground truth city. For instance, Florence and Milan are both in Italy; Beijing and Shanghai are both in China. The results of country recognition are not presented here. They demonstrate similar findings to city recognition.

To conclude, we see that place-related tasks are often very challenging: 1) Places of parks, gardens, churches, etc. are easy to classify; However, it is difficult to distinguish one park/garden/church from another. 2) Under bad environmental condition, photos can be extremely difficult to categorize; 3) To recognize the function of a street or a shop is non-trivial, i.e. it is hard to determine their use for people to have a dinner, take a drink, or go shopping. 4) Cities of long history such as Beijing and Florence are often recognized with a high accuracy. While images of others are more likely to be misclassified as similar ones inside and outside their countries. We hope that Placepedia with its well-defined benchmarks can foster more effective studies and thus benefit place recognition and retrieval. The last line of Table 3 shows that, to train PlaceNet on larger amount of data, we can further obtain performance gain, by 7 to 16% on different tasks.

5 Study on Multi-faceted City Embedding

We embed each city in an expressive vector to understand places on a city level. Also, the connections between the visual characteristics and the underlying economic or cultural implications are studied therefrom.

5.1 City Embedding

City embedding is to use a vector to represent a city, the items of which indicate different aspects, e.g. the economy level, the cultural deposits, the politics atmosphere, etc. In this study, cities are embedded from both visual and textual domains. 1) Visual representations of cities are obtained by extracting features from models supervised by city items. 2) Leading paragraphs collected from Wikipedia are used as the description of each city. [7] provided a pre-trained model on language understanding to embed the content of texts into numeric space. We use this model to extract the textual representations for all cities.

Network Structure. The model for city embedding is illustrated in Fig. 5 (b). The input is constructed by concatenating visual and textual vectors. Two fully connected layers are then applied to learn city embedding representations. The corresponding activation functions are ReLU. At last, a classifier and cross entropy loss are used to supervise the learning procedure.

Representative Vectors. We train the network iteratively. The well-trained network is then used to extract the embedding vectors for all images. City embeddings are then acquired by averaging image embeddings city-wise.

5.2 Experimental Results

We analyze city embedding results from two aspects. Firstly, we compare the expressive power of embeddings using different information, namely vision, text, and vision & text, in order to see if learning from both can yield a better city representation. Secondly, we investigate the embedding results neuron-wise to explore what kinds of images can express economic/cultural/political levels of cities the most.

Visual and Text Embedding. In Fig. 7, we demonstrate three embedding results using t-SNE [36]. 1) The left graph shows embeddings using only visual features. We observe that it tends to cluster cities that are visually similar. For example, Tokyo looks like Taipei; Beijing, Shanghai, and Shenzhen are all from China, and Seoul shares lots of similar architectures with them; Florence, Venice, Milan, and Rome are all Italy cities. 2) The graph on the middle shows embeddings using only textual features. We can see that textual features usually express the functionality and geolocation of a city. For example, Tokyo and Oslo are both capitals; London and New York are both financial centers. However, they are not visually alike. Also, cities from the same continent are clustered. 3) The right graph shows embeddings learned from both visual and textual domains. They can express the resemblance visually and functionally. For example, cities from east/west-Asia are all clustered together, and cities from Commonwealth of Nations like Sydney, Auckland, and London, are also close to each other on the graph. From the comparison of these graphs, we conclude that learning embeddings from both vision and text content produces the most expressive power of cities.

Fig. 7.
figure 7

These three figures show t-SNE representation for city embeddings using vision, text, and vision & text info, respectively. Points with the same color belong to the same continent. We can see that learning from both generates the best embedding results

Fig. 8.
figure 8

Three graphs rank Pearson correlation based on neurons in terms of economy, culture, and politics. Below each presents top-3 places activating the neuron of the largest correlation value

Economic, Cultural, or Political. We follow the work in [53] to represent each city in three dimensions, namely economy, culture, and politics. In [53], word lists indicating economy, culture, politics are predefined. In this work, leading paragraphs of Wikipedia pages are adopted as our city description. For each city, we calculate the weights of economic, cultural, and political therefrom as in [53]. And we match each neuron to them using Pearson correlation [3], in order to quantify the connection between each neuron and them. Quinnipiac UniversityFootnote 4 concludes that a correlation above 0.4 or below \(-0.4\) can be viewed as a strong correlation. From Fig. 8, we see that neurons can express culture most confidently, with the highest correlation score larger than 0.6. This is consistent with our knowledge, i.e. culture usually is expressed from distinctive architectures or some unique human activities. Looking at the top-3 places that activate the most relevant neuron, we observe that: 1) Economy level is usually conveyed by a cluster of buildings or the crowd on streets, indicating a prosperous place; 2) Cultural atmosphere can be expressed by distinguished architecture styles and human activities; (3) Political elements are often related to temples, churches, and some historical sites, which usually indicate religious activities and politics-related historical movements.

6 Conclusion

In this work, we construct a large-scale place dataset which is comprehensively annotated with multiple aspects. To our knowledge, it is the largest place-related dataset available. To explore place understanding, we carefully build several benchmarks and study contemporary models. The experimental results show that there still remains lots of challenges in place recognition. To learn city embedding representations, we demonstrate that learning from both visual and textual domains can better characterize a city. The learned embeddings also demonstrate that economic, cultural, and political elements can be represented in different types of images. We hope that, with comprehensively annotated Placepedia contributed to the community, more powerful and robust systems will be developed to foster future place-related studies.