Keywords

1 Introduction

Information and communications technologies (ICTs) enables to advance with new research questions that facilitate a better understanding of ourselves and the surrounding environment (Manovich 2011; Dalbello 2011). The last two decades have brought multiple tags related to the vast quantities of data made available by ICTs, e.g., “big data”, “data avalanche” (Miller 2010), “exaflood” (Swanson 2007).

Geographers have been dealing with some of the issues raised by big data (Barnes 2013), questioning its theory and related practices shifts (Floridi 2012; Boyd and Crawford 2012; Crampton and Krygier 2015). Yet, it still has to be done a substantial and continued effort to understand its geographic relevance, as is in the case of the connection between big data and geography (Graham and Shelton 2013).

Despite big data puts challenges to conventional concepts and practices of “hard” sciences, where Geographic Information Science is included (Goodchild 2013; Gorman 2013), the predominance of big data will undoubtedly lead to a new quantitative turn in geography (Ruppert 2013). This is clearly a new paradigm shift in geography research methodologies: a fourth—data-intensive—paradigm (Nielsen 2011).

Geographic technologies are now integrated into social sciences, and promotes the value of geography to a wider audience. There is a growing list of applications of Geographic Information Systems (GIS) that expose its potential for handling the data deluge. Making sense of big data requires both computationally based analysis methods and the ability to situate the results (Berry 2012). Yet, it brings together the risk of plunging traditional interpretative approaches (Gold 2012). The big data era calls for new capacities of synthesis and synergies between qualitative and quantitative approaches (Sieber et al. 2011).

This paradox alliance between “poets and geeks” (Cohen 2010), can be a unique opportunity for geography, stimulating wider efforts to create a bridge over the qualitative–quantitative crater (Sui and DeLyser 2011) and enabling smart combinations of quantitative and qualitative methodologies (Bodenhamer et al. 2010; Daniels et al. 2011; Dear et al. 2011).

The emergence of critical geography, critical GIS and radical approaches to quantitative geography, fostered the idea that geographers are well prepared to combine quantitative methods with technical practice and critical analysis (Lave et al. 2014). This proved to be not quite true, but currently big data opens, specially through data mining, new possibilities for spatial analysis research (Michel et al. 2011) and can extend the limits of quantitative approaches to a wide array of problems usually addressed qualitatively (Lieberman-Aiden and Michel 2011; Michel et al. 2011).

It is a similar case to the rebirth of social network theory and analysis where due to the growing availability of relational datasets covering human interactions and relationships, researchers managed to implement a new set of theoretical techniques and concepts embracing network analysis (Barabási and Pósfai 2016).

Surveys are an example of this new paradigm shift. This method to collect information can exemplify the crisis of those widely used methods facing some difficulties regarding its utility often caused by the decline of response rates, sampling frames and the narrow ability to record certain variables that are the core or geographical analysis, e.g., accurate geographical location (Burrows and Savage 2014). Gradually, self-reported surveys quantifying human motivations and behaviours are being study and compared with non-traditional data (Struijs et al. 2014; Daas et al. 2015).

Such limitations are still more pronounced while considering that: (i) the majority of social survey data is cross-sectional deprived of a longitudinal temporal facet (Veltri 2017); and (ii) most social datasets are rough clusters of variables due to the restrictions of what can be asked in self-reported approaches.

Big Data is leading to advances on both aspects, shifting from static snapshots to dynamic recounting and from rough aggregations to data with high (spatial and temporal) resolution (González-Bailón 2013; Kitchin 2014).

Understanding social complexity requires the use of a large variety of computational approaches. For instance, the multiscale nature of social clusters comprises a countless diversity of organizational, temporal, and spatial dimensions, occasionally at once. Moreover, computation denotes several computer-based tools, as well as essential concepts and theories, varying from information extraction algorithms to simulation models (Cioffi-Revilla 2014; Alvarez 2016).

Big data and its influence on geographic research has to be interpreted in the context of the computational and algorithmic shift that may progressively influence geography research methods. To understand such shift, a distinction between two modeling approaches has to be addressed (Breiman 2001; Gentle et al. 2012; Tonidandel et al. 2016): (i) The data modeling approach that assumes a stochastic model in which data and parameters follow the assumed model; and (ii) The algorithmic approach that considers the data as complex and unknown, and is focused on finding a function that imitates the mechanism of the data-generation process, reducing the statistical model to a function, and keeping out any assumption about the data, e.g., data distribution assumptions (Breiman 2001; Veltri 2017). Whereas the former evaluates the parameters values from the data and then uses the model for information and/or prediction, in the latter there is a move from data models to algorithms properties. Here, what matters the most is an increased emphasis on processes rather than structures.

Big data introduces the possibility of reframing the epistemology of science, presenting two potential paths underpinned by disparate philosophies, Empiricism, and Data-driven science (Kitchin 2014).

1.1 Embracing “Big” Changes Beyond Traditional Methods of Data Collection

The label “big data” points to three features, also known as the 3Vs: (i) volume, regarded as the quantity of captured and stored data; (ii) velocity, intended as the quickness at which data can be collected; and (iii) variety, incorporating both structured (e.g., tables and relations) and unstructured (e.g., text and photographs) data (Kitchin 2014; Tonidandel et al. 2016). A fourth V, Veracity, has been added but, as denoted by Kitchin (2014), the term “big data” goes further and describes a type of analytic approach.

This is precisely the type of data created from immense complex systems simulations, e.g., cities (Miller and Goodchild 2015) but a big percentage of it, is provided by sensors and/or software that collect a wide range of social and environmental patterns and processes (Graham and Shelton 2013; Kitchin 2013). The sources of this spatial and temporal data embrace location-aware tools such as mobile phones, airborne (e.g., unmanned aerial vehicles) and satellite remote sensors. Automated data is also generated as digital traces recorded on social media, among others online platforms (Miller 2010; Sui and Goodchild 2011; Townsend 2013).

There is in big data an enormous potential for innovative statistics (Daas et al. 2015). Geolocation data retrieved from mobile phones records can be used to get virtually instant statistics of tourism and daytime/nighttime population (de Jonge et al. 2012). Simultaneously, social media can serve as the background to produce indicators of human mobility (Hawelka et al. 2014). Big data can also be used to replace or complement historical data sources, e.g., surveys, inquiries and governmental data. For instance, inquiries about road usage may become obsolete if detailed traffic data obtained by sensors on the road come to be available (Struijs and Daas 2013).

Part of big data sources, including social media, are made of empirical data and are not intentionally planned for supporting data analysis, i.e., they do not have a clear structure, a well-defined target population and/or proved quality. In this context, it is problematic to make use of statistical methods based on sampling theory, i.e., traditional methods (Kitchin 2013; Daas and Puts 2014a). Specially, the unstructured facet of several of the big data sources makes exponentially difficult to (data)mine significant statistical information. In numerous of these sources, the data explanation and its relations with social phenomenon’s are still a very fuzzy field of analysis (Daas and Puts 2014b; Tonidandel et al. 2016).

In a broader perspective, there is another issue regarding the human and technical capacity required for processing and analyzing big data. Contemporary data researchers are probably better prepared than traditional statisticians are. Perhaps the upmost importance is the necessity for a distinct mind-set because big data points toward a paradigm shift (Kitchin 2014), comprising an increased and improved use of modeling practices (Struijs and Daas 2013; Daas and Puts 2014a).

Before big data, random sampling was the main approach to deal with information burden. This method works well, but has its own fragilities: it only performs well if the sampling is representative. Moreover, the sampling basis, i.e., a procedure for numbering and selecting from populations, may be tricky if numbering is performed imperfectly.

Sample data is also very attach to the objective it was first intended. Since randomness is so important it may be difficult to reanalyze the data with different purposes than those for which it was collected (Mayer-Schonberger and Cukier 2014). By the contrary, several of the new data sources do not rely on samples but in populations. Yet, populations have the problem of being tendentially self-selected rather than sampled. For instance, all people having smartphones, all people who engaged “Flickr” or any other social network, or all vehicles traveling in the City of Lisbon between 17:00 and 21:00 on a specific day. In addition, tweets can be a striking source of information (Tsou et al. 2013; Hawelka et al. 2014) but only a part of them are actually geo-located. Despite the specific characteristics of any of these groups may remain to be clarified; it is possible to generalize them to the populations they were sampled from (Encalada et al. 2017).

Nevertheless, some care should be taken since some information people voluntarily provide could not reflect a “real measurement” about their activities (e.g., digital traces of commuting behaviours). Furthermore, selection biases can also occur in the information people volunteer about their surrounding environment. For example, Open Street Map (OSM) is frequently recognized as a popular Volunteer Geographic Information (VGI) venture. Many places around the world, including those in developed countries, have been mapped through OSM with a noteworthy degree of accurateness. Nonetheless, some places such as tourist locations are mapped faster and/or better than others of less interest to OSM users, such as slums (Haklay 2010).

Of course biases also exist in official maps, because the governments (even the developing nation’s ones) frequently do not map unconventional settlements such as slums and/or do not update regularly the existent cartography due to budget restrictions. Yet, the biases in VGI maps are probable more subtle. However, VGI platforms such as OSM facilitate tools for data cleaning and validation, so the users (acting as creators and co-creators) are able to remove the fuzziness as much as is conceivable. Goodchild and Li (2012) discussed the challenges regarding the quality of VGI. They concluded that both, traditional and non-traditional geographic information depend on multiple sources and on people expertise to draw together a cohesive image of the landscape. For instance, surface information may be collected from photogrammetry, terrain measurements, historic sources and crowdsourced data. From this synthesis process, the resulting map might be more truthful than any of the original sources by itself, i.e., the all is more than the sum of the parts.

1.2 The Analytic Background of (Big)Data Mining

Defenders of big data suggest that it generates thrilling prospects. Though detractors consider it to be more propaganda than reality (Franks 2012; Savitz 2013). Furthermore, big data analysis can be disapproved as a form of “dust-bowl empiricism” (Ulrich 2015; McAbee et al. 2017). Thus, there is a significant breach in our understanding of both the potential and threats of big data (Tonidandel et al. 2016).

Much of the geographic knowledge is based of formal theories, models, and equations that need to be processed in an informal manner. By the contrary, data mining techniques require explicit representations, e.g., rules and hierarchies, with straight access deprived of processing (Miller 2010).

Geography has a history of a relation between law-seeking (nomothetic) and description-seeking (idiographic) knowledge (Cresswell 2013). Wisely, physical geographers get away from these debates, but the nomothetic-idiographic tension keep on in human geography (Sui and DeLyser 2011; DeLyser and Sui 2012; Cresswell 2013). Possibly without surprise, geography has been censured for invalidated theories, results that cannot be reproduced, and a division amongst practice and science (Landis and Cortina 2015). Putka and Oswald (2015) indicate how geography could benefit by implementing the data algorithmic philosophy, and claim that the actual data modeling philosophy prevents the ability to predict results more accurate, and generates models that do not integrate phenomenon’s key drivers, without incorporating uncertainty and complexity in a satisfactory manner.

Big data provides chances to detect genuine relations patterns (Dyche 2012). It is realistic to conceive that big data would allow to clarify some of the residual variance. This incremental legitimacy can arise from improved predictors, e.g., Internet footprints (Youyou et al. 2015).

As denoted by Tonidandel et al. (2016), multiple regression is undoubtedly the most widely used statistical approach. Multiple regression assumes that the model being performed is the most correct. Regrettably, the background theories are hardly ever satisfactorily developed to include the most pertinent variables. Also, many often researchers do not even know what can be the missing variables. Hence, researchers test a limited set of variables, and face the possibility to embrace mislaid variables with implications in the model accuracy, and thus in the conclusions drawn from the data (Antonakis et al. 2010).

In opposition to this traditional methodology, the data analytic approach trusts on multiple models or group of models. While the former focuses on selecting the best model and accept that it properly defines the data-generation process, the later analyses all the possible models to be resultant from the existing set of variables and combines the results through a multiplicity of techniques, e.g., bootstrap aggregation, support vector machines, neural networks (Seni and Elder 2010). The subsequent group of models achieves better results, deriving higher accurate predictions (Markon and Chmielewski 2013; Kaplan and Chen 2014). Big data analytics rooted in machine learning techniques can automatically detect patterns, create predictive models and optimize outcomes, facilitating traditional forms of interpretation and theory building. However, in some cases, the new data analytic techniques may not improve outcomes assessed by using more traditional techniques (Schmidt-Atzert et al. 2011).

Big data is supported by a theory platform and it could not be other way. Rather than compare big data to “dust-bowl empiricism”, it leads to what Kitchin (2014) describes as data-driven science. Moreover, as denoted by Miller and Goodchild (2015), a data-driven geography may be emerging.

The notion of data-driven science defends that the generation of hypothesis and theory creation resemble an iterative process where data is used inductively. “Dust-bowl empiricism” stopes after the data mining process whereas the process of data-driven science goes on by coupling the inductive and deductive methods (as an iterative process). Hence, it is possible to name a new category of big data research that handles to the creation of new knowledge (Bakshy et al. 2014). Since the inductive process should not start in a theory-less void, preexisting knowledge guides the analytic engine in order to inform the knowledge discovery process, to originate valuable conclusions instead of detecting any-and-all possible relations (Kitchin 2014).

In spatial analysis, the tendency in the direction of local statistics, e.g., geographically weighted regression (Fotheringham et al. 2002) and (local) indicators of spatial association (Anselin 1995), characterize a concession where the main rules of nomothetic geography can evolve on their own way, across the geographic space. Goodchild (2004) sees GIS as a mix of both the nomothetic and idiographic approaches, retained, respectively, on the software and algorithms, and within the (spatial) databases.

Despite it is possible to go for geographic generalizations, space still matters. Both spatial dependency and heterogeneity generate a local context that shapes the processes that occur at the Earth surface. Geography has this believe for many years now, but this has been strengthened by the recent developments in complex systems theory, i.e., local interactions drive to emergent behaviors that are impossible to understand singularly and independently if they are analyzed from a sole (local or global) perspective. The co-created knowledge derived from the interactions between agents within a certain environment links the local and the global perspectives (Miller and Goodchild 2015).

Briefly, there is not a drastic breakdown with the tradition on geography when researchers move to data-driven geography, especially, in applied research. There is a long-lasting confidence regarding the significance of idiographic knowledge per se and its contribution on creating nomothetic knowledge. Even though this confidence is sometimes weak and questioned, data-driven knowledge discovery offers the chance to advance the relationship amongst idiographic and nomothetic geography. Still, despite the fact that complexity theory supports this idea, it advises at the same time that data-driven knowledge discovery may have intrinsic limitations, i.e. emergent behaviour is unpredictable by definition.

2 Big Data from Social Media and Its Potential for Spatial Analysis of Urban Tourism Activities

2.1 Tracking Tourists’ Itineraries: Non-traditional Data Sources

Most of statistical systems supporting the analysis and understanding of the tourism phenomenon in an urban context are based on the use of three indicators: tourist arrivals; overnights; occupation in accommodation units (Heeley 2011). These indicators allow a generic and dynamic reading of the demand flows associated with city tourism. On the other hand, traditional statistical tools and methods can only measure the participation of tourists in “controlled sites” (e.g., museums, hotels, etc.). Both are, however, very limited when a more in-depth analysis of the phenomenon is sought on an intra-urban scale (Ashworth and Page 2011).

Understanding the complex, and often unequal, spatiality of tourist demand in the urban space requires other methodologies, among which the information base available online and in social networks has gained prominence. This, being increasingly georeferenced, allows a more realistic and informed perception about tourist geography(ies) on urban destinations: places of greater/lesser attractiveness; mobility patterns; etc. Such information reveals an advantageous and complementary option to official data (Goodchild and Li 2012), mainly due to its diversity, quantity, timeliness, and continuity.

Greater access to information—facilitated by new Information and Communication Technologies and a profile of tourists seeking more and more frequent online content—coupled with a growing predisposition to share information in social media, have allowed a greater knowledge of the characteristics and behaviour of tourists (Buhalis and Law 2008; Tussyadiah 2012).

Crowdsourced data, coming from social networks, contributes to the understanding of the fruition/consumption of space within urban destinations. The geotagged photos published by users on “Panoramio” and “Flickr” social networks, during their visit to the city of Lisbon, allow us to present a quantitative and geographic reading of urban tourism spatial production and consumption. Particularly, the data extracted from these sources provides meticulous information, of great value, for the identification of places of concentration, in dense and complex areas.

The emergence of Web 2.0 enabled the use of the Internet as a communication channel, by generating a vast collection of digital platforms, as those identified as social media. Social media is defined as any digital platform where users can participate, create and share content. Kaplan and Haenlein (2010) distinguished the following media: blogs, content communities, social networks, websites of recommendations and evaluations (Consumer review websites), instant-messaging sites and photo-sharing, etc. (e.g. Viajecomigo, Tripavisor, Twitter, Facebook, Flickr, Panoramio).

The extensive use of the Internet has increased the influence of the content shared in these platforms, on user behaviour and, more specifically, on the behaviour of tourists (MacKay and Vogt 2012; Tussyadiah 2012). Its impact has been significant in the tourism industry (Leung et al. 2013), denoting a growing tendency for tourists to share their experiences by publishing their recommendations, reviews, photos, or videos about a destination, activity, or service, particularly in social networking sites (Buhalis and Law 2008).

ICTs platforms, sensor networks, and wireless communication systems contributes the integration and data exchange. We are experiencing a new era, where information is produced in part by users. This type of information is referred to as User-generated Content (UGC) or Crowdsourced Data (Kaplan and Haenlein 2010), Volunteer Geographic Information (VGI)—most commonly used in the field of geography—or Community-contributed Data (Goodchild 2007; Andrienko et al. 2009).

In the context of geographic information, online content accessible on these media platforms has become part of the set of data sources available for data gathering, overpassing the condition in which the information was produced and distributed exclusively by the official authorities (Sui et al. 2013).

UGC, as opposed to top-down methodologies, has promoted individuals themselves as information generators with high spatial and temporal resolution, boosting the framework of alternatives to track their location (Sui and Goodchild 2011). Georeferenced information constitutes one of the most important types of UGC. Geospatial technologies enabled social networks with positioning and mapping tools, which have led to a massive volume of georeferenced data.

When tourists use their mobile phones, their credit cards, or through access to social networks, leave behind large amounts of digital traces about their activities within a destination (Buhalis and Amaranggana 2014; Hawelka et al. 2014). These digital traces are often openly available. While traditional methods on geographic data collection were based on technically demanding, accurate, expensive and complicated devices, non-traditional sources offer cost-effective information acquired through everyday devices such as mobile phones (Li et al. 2016).

A valuable feature of the UGC grounded on social media is its continuous availability, almost in real time, which means, in most cases, information can be used to analyse current issues which require continuous observation, allowing to change the analytical meaning of a static approach to a more dynamic monitoring process (Sui and Goodchild 2011; Díaz et al. 2012).

This information can be used as a proxy to find patterns in the spatial distribution of visitors within a destination. For instance, several authors have performed analysis based on data extracted from social networks and other online platforms (e.g., wikipedia, wikitravel, and Foursquare) to identify points of tourist interest in different areas of the world (Tammet et al. 2013).

Besides current developments in storing, processing and analyzing this information, it presents some challenges as the lack of assurance regarding its quality (Li et al. 2016), contrary to what happens with information from official sources, since the latter is collected and documented through well-known established procedures (Goodchild 2013). However, this type of data might play a useful role to drive exploratory analysis of a phenomenon (Goodchild and Li 2012).

All this innovation generated by ICTs in terms of data sources has emerged as an additional and complementary support tool for the more traditional (data) sources. Therefore, non-traditional data should not be regarded as a substitute for official data or data collected through traditional scientific methods, but complementary (Goodchild and Li 2012).

The identification of tourist patterns/behaviours/preferences that express themselves through the digital imprints generated in the tourist destination fill a relevant gap in the knowledge about intra-destination mobility and, generally, on the more informal and less documented fruition of the tourist space.

Although the opportunities provided by online information shared by tourists are plenty and prone to unveil some geographical features in a place or a region (through GIS and spatial analysis), this is a recently open field (Zhou et al. 2015) and still hindered by several constraints (data volume, velocity, variety and reliability, among others) and also by some suspicion as a novel research tool, using new information sources in tourism.

The objectives of this chapter are to highlight consistent patterns of tourism production and consumption, in what can configure different tourist geographies of the city of Lisbon, perceived from the analysis of non-traditional data available in social networks platforms, and at the same time try to understand how this new paradigm of big data and mining techniques will affect the future of geographic analysis.

2.2 Urban Analytics: The City of Lisbon and Lisbon Metropolitan Area

The city of Lisbon, centrally located in the metropolitan area of Lisbon (LMA). The LMA is composed of five touristic official regions, covering the municipalities of Oeiras, Amadora, Odivelas, Loures, Mafra, Almada, and Lisbon (Fig. 6.1).

Fig. 6.1
figure 1

Tourism centralities in the Lisbon metropolitan area

In an attempt to demonstrate the novel opportunities of non-traditional data as complementary (data)sources for applied research, we refer to six indicators of tourism activity, based on data made available by one of the three Portuguese Mobile Phone Operators, i.e., NOS®, about foreign users visiting and moving around the LMA and Portugal. Tourism density values (number of distinct tourists by km2), an index in a standardized scale of 1 (minimum) to 4 (maximum), show a predominant area within the LMA (Fig. 6.2) which corresponds to the Lisbon centrality. Almost all the cities corresponding to this region (having a tourism density equals to 4) are also in the top ten of Portugal municipalities with highest densities: Lisbon (1); Oeiras (3); Amadora (4); Almada (7); Odivelas (9) and Cascais (10).

Fig. 6.2
figure 2

Tourism density by municipality, in Lisbon metropolitan area (2017)

As specified by the Tourism Observatory of Lisbon (OTL), “City & short break” is considered the largest motivation for visiting Lisbon (Observatório de Turismo de Lisboa (OTL) 2016). Besides, the weekenders’ index (Fig. 6.3) suggest a higher tourist presence on the weekend. This index represents the ratio of the daily average number of tourists at the weekend comparatively to the week, in a month (a value greater than 100 means more tourists on the weekend than on the week). Despite its importance, Lisbon does not take the lead when compared to other cities in the LMA. The two main municipalities are located in the south bank of Tagus River: Sesimbra (156) and Almada (153). They have both a long tradition of sun-and-beach tourism for short periods. This pattern is still supported by the municipalities which ranked third and fourth, Oeiras (141) and Cascais (148), both situated in the north Shore of Tagus River and with similar characteristics of the southern counterparts. The only non-beach municipality in the top five is Lisbon (141), which embodies the idea of “City & short break” attractiveness. Yet, these values are very far from the national top ten, denoting the importance of municipalities from the interior, mostly the North interior, of mainland Portugal.

Fig. 6.3
figure 3

Ratio of weekenders by municipality, in Lisbon metropolitan area (2017)

With regards to tourism demand statistics, in 2016, the average stay of foreign guests in Lisbon was up to 2.6 nights. This value is similar for the LMA but lower than mainland Portugal (Instituto Nacional de Estatística (INE) 2017). Besides, the statistic from the mobile phone operator shows a slight increase in 2017, with an average stay of 3.2 nights (Fig. 6.4a). The average number of foreign overnight stays is still one of the lowest in the LMA and from Portugal.

Fig. 6.4
figure 4

Average number of night’s spent (a) and night attraction (b) by municipality, in Lisbon metropolitan area (2017)

The municipalities leading the national ranking are, as expected, located in the islands of Azores and Madeira. Nonetheless, there are two LMA municipalities in the top ten, i.e., Odivelas (5.1) and Moita (4.6). Both have a vast immigrant population that runs through extended visits from relatives and friends.

When considering the night attraction index (it calculates the percentage of tourists at night compared to the total amount of tourists, in a month), Lisbon clearly stands out in the metropolitan area (Fig. 6.4b), being also the second municipality in the national context. Once again, the municipalities with higher night attraction scores are the ones from the Islands. However, in this case, Oporto (tenth) is also in the national top ten.

Finally, lunch (Fig. 6.5a) and dinner (Fig. 6.5b) attraction indexes show different patterns. Both, represent the percentage of tourists at lunchtime/dinnertime in relation to the total number, in a month. Despite having a high lunch attraction (62%), Lisbon stays behind Sintra (67%), a world heritage village, and Cascais (66%), a famous destination for sun-and-beach tourism. On the contrary, when looking at the dinner attraction, Lisbon takes the lead in the metropolitan area. Roughly, values from both indexes may suggest a commuting pattern within the region. It seems that some tourists visiting Lisbon travel to Sintra-Cascais (also known for its landscape-protected area) for spending the day (as denoted by the lunchtime index) but coming back at the end. So far, these outcomes illustrate an interesting flow that should be further explored.

Fig. 6.5
figure 5

Lunch (a) and dinner (b) attraction by municipality, in Lisbon metropolitan area (2017)

2.3 Data Collection of Online Footprints from Social Networks

Recently, researchers have shifted their attention to social media as an alternative data source for collecting information about tourist activities. Here, we explore the value of geotagged data from two social networks in studying tourist spatial behaviour. We use data from “Panoranio” (in 2007 Google acquired Panoramio and closed it down, in late 2016) and “Flickr”. Both social networks provide access to the online data through their Application Programming Interfaces (API). In addition to the users’ photos (images), metadata information such as users’ identification, timestamps and geolocation, is available as well.

According to the protocols from each APIs, the retrieving process must be forwarded through a HTTP request, by setting some parameters (e.g., defining a bounding box to overlap an area, a valid data format, etc.) and data specifications (e.g. a time window, a set of keywords, etc.). Since there are some restrictions to retrieve data (e.g., “Panoramio” allowed to retrieve information up to 500 photos per request, for a given area), the requests are usually implemented following an automated scheme (i.e., a recursive algorithm) that controls the iterative process.

The study area was segmented into smaller areas and, for each new unit, we downloaded the online data within its extents. All geotagged photos (and metadata) were stored in a database. “Panoramio” database reached more than 70,000 records (including the image, photo description, users’ id, geolocation-coordinates, timestamps, numbers of views, etc.). Similarly, all geotagged photos from “Flickr” were more than 200,000 records.

The identification of photos uploaded by visitors was based on photos’ timestamps, following previous works (Girardin et al. 2008; García-Palomares et al. 2015; Encalada et al. 2017). Geotagged photos were classified as belonging to visitors only if the difference (in days) between the timestamps of the first and the last photos uploaded by each user, do not exceed the average stay of foreign tourists within the city. Since the average stay for the last year (2016) of the time-series is close to 3 nights, only photos taken during a period of less than 4 days were cataloged as belonging to visitors.

The final dataset comprise 19,578 photos from “Panoramio” and 73,314 from “Flickr”. Photos spatial distribution within the city of Lisbon is depicted on Figs. 6.6 and 6.7 “Panoramio” dataset contains photos from 2008 to 2014, and “Flickr” from 2008 to 2016. These digital footprints belong to more than 15,000 users (from both social networks) considered as city tourist.

Fig. 6.6
figure 6

Visitor’s geotagged photos from ‘Panoramio’, from 2008 to 2014

Fig. 6.7
figure 7

Visitors’ geotagged photos from ‘Flickr’, from 2008 to 2016

2.4 Addressing the Spatial Distribution of City Tourists

In this section, we refer to some (traditional) methods of exploratory analysis to discover significant patterns on spatial data. A brief analytical scheme, ranging from global to local metrics, is presented. Our selection criteria regarding spatial analysis techniques relies, mainly, on their generally wider applicability since we aim to reach a broader audience interested on this type of analysis. It should be noted that we keep spatial analysis tools at the simplest level, however, for more details readers should refer to (García-Palomares et al. 2015; Encalada et al. 2017).

When analyzing spatial data, the starting point is to validate the spatial autocorrelation and determine if the global distribution of the data is scattered, concentrated or random. This can be assessed using spatial autocorrelation indexes (e.g. Nearest Neighbor Index, Global Moran’s I, Global Getis-Ord Index).

Global statistics are not able to measure how spatial dependence varies from place to place. In order to capture the heterogeneity of spatial dependence, statistics might be applied to the local scale. Although the geotagged photos show a roughly concentration on some areas (as depicted on Figs. 5.6 and 5.7), it is necessary to prove whether the observed pattern is statistically significant and, thus, supported by an underlying spatial process. Local indicators such as Local Moran Index and Getis-Ord Gi* lead the disclosure of spatial clusters (i.e. places of tourist concentration) that are statistically significant. The Local Moran Index indicates the spatial concentration of similar values as well as spatial outliers. To perform the analysis, it is necessary to define the neighboring area (i.e., distance threshold) and the nature of the spatial relationship between observations (i.e., the notion of proximity between observations, being reached, in most of the cases, by creating a spatial weights matrix.).

Another practice frequently performed for analyzing and visualizing point features is the Kernel Density Estimation (KDE). It calculates the magnitude per unit area of a given number of points (using the Kernel function), producing a smooth density surface over space by computing the features intensity as density estimation. The basis of these methods is the Tobler’s first law which states that everything is related to everything but near things are more related than distant ones.

Furthermore, spatial patterns change over time. While the former techniques can handle the spatial context, other methods (e.g. Emerging Hot Spot Analysis) support the analysis of both spatial and temporal patterns emerging from a set of observations. Thus, in addition to discover significant spatial clusters, the seasonal pattern can be obtained as well (i.e., whether the hot pots are consecutive, sporadic, etc.).

To assess the geographical patterns of urban tourists within Lisbon, two methods were used, Clusters and Outliers analysis (based on Local Moran Index) and the Kernel Density Estimation. For the Cluster and Outlier analysis, data was aggregated to a continuous hexagon surface. Assuming that all photos within the study area may not be spatially related, a threshold for the neighborhood radius of influence was determined to run the analysis (a threshold distance equals to 150 m). Still, the inverse distance was chosen to conceptualize the spatial relationship, thus, the influence of the neighboring features will decrease as the distance between them increase. Besides, since Kernel function depends on a given distance parameter (e.g., increasing the value of the search radius results in a broader and lower kernel and, thus, showing the spatial tendency on a more global scale), similarly to the Local Moran Index parameters, it was defined a search radius of 150 m. The outcomes are presented in the next section.

2.5 Mapping the Spatial Distribution of City Tourists

The visual representation of geotagged photos shows a trend for clustering, mainly in the areas with more touristic appeal. Furthermore, the Local Moran Index outlines a more accurate picture while identifying the city tourist hot spots (i.e., statistically significant places of tourists concentration). Thus, the most relevant touristic sites are discriminated from the overall sample of points previously mapped and depicted on Figs. 5.6 and 5.7.

As expected, clusters are located nearby the well-known city tourist attractions (Figs. 5.8 and 5.9). In general, places of interest such as viewpoints, squares, monumental architecture, and other cultural and recreational attractions function as focus of spatial clusters. The historic center clearly stands out from other touristic areas (on Fig. 5.8, “Eduardo VII” park3; “Marquês de Pombal” monument4; “Rossio” square5; “Comércio” square6; “São Jorge” Castle7). A smaller number of significant clusters were uncovered over “Belém”—to the southwest (on Fig. 5.8, “Belém” Tower1; “Padrão dos Descobrimentos”2 monument), and in “Parque das Nações”—to the northeast (on Fig. 6.8, Lisbon Oceanarium8).

Fig. 6.8
figure 8

Clusters from ‘Panomario’ dataset

The majority of significant clusters belong to the High-High category, corresponding to places with a large number of tourist’s photos surrounded by similar high counts. On the contrary, there are few atypical clusters (Low-High) located in the surrounding areas of the identified hot spots. These outliers expose some places less visited when compared to the visitor’s presence in its neighborhood.

The main difference between both maps (Figs. 6.8 and 6.9) comes from the fact that “Panoramio” users were more voted to upload photographs illustrating places (e.g., open space areas). Instead, Flickr’s photos are more “relaxed” and depict memories about any topic (e.g., social events, daily activities, people, etc.). For instance, the Benfica stadium3 corresponds to a significant cluster (Fig. 6.9) based on visitors photos from “Flickr”, but it is not for “Panoramio”.

Fig. 6.9
figure 9

Clusters from ‘Flickr’ dataset

From the analysis of both density maps (Fig. 6.10), visitor’s activity shows a higher density in locals within the Tourism Micro-centralities (i.e., areas of major touristic interest identified by the City Tourism Office). Three areas are highlighted, “Belém” (southwest), the Historic Center and “Parque das Nações” (Northeast). By contrast, few sites with low densities in the inner part of the city reveal an irregular and lower (still significant) attention of city’ visitors.

Fig. 6.10
figure 10

Kernel density (photos/m2) of geotagged photos from ‘Panoramio’ (top) and ‘Flickr’ (bottom)

The heat maps effectively summarize some visitor’ places of interest. Although “Panoramio” overall kernel density is slightly lower than the results from “Flickr”, the hot spots areas match in both cases. For instance, “Comércio” square and “São Jorge” Castle, “Jerónimos” Monastery and “Padrão dos Descobrimentos” monument (southwest), and the Lisbon Oceanarium. Less highlighted places can be identified as well, such as “Rossio” and “Restauradores” squares, and “Marquês de Pombal” monument.

The cross-reading of the resulting maps from the KDE and Cluster analysis, points out that tourist attractions, in fact, are the focus of visitor’s clusters. The spatial extent of areas with intensive tourist presence follows a pattern. While the tourist attractions show higher densities, their intensity decrease gradually as the distance to these cores increases. In many cases, the spatial extent follows the physical shape of tourist attractions (e.g., squares, pedestrian streets), expanding across those areas and beyond their perimeter to other nearby areas.

Empirical evidence suggests that urban tourism studies might benefit from geotagged digital data. These outcomes demonstrate that touristic areas can be properly identified and differentiated from others with lesser or non-related tourism activities and visitation.

3 Challenges Regarding the New Paradigm Shift in Geography Research

Spatial modeling and in a broad sense Geography, have shifted from a data-scarce to a data-rich environment. The critical change is not about the data volume, but relatively to the variety and the velocity at which georeferenced data can be collected and stored. Data-driven geography is (re)emerging due to a massive georeferenced data flow coming from sensors and people.

Data-driven geography raises some issues that in fact have been long-lasting problems debated within the research community. For instance, dealing with large data volumes, the problem of samples versus populations, the data fuzziness, and the frictions between idiographic and nomothetic approaches. Yet, the conviction that location matters (i.e, spatial context) is intrinsic to geography and serves as a strong motivation to produce refined methods on spatial statistics, time-geography, and GIScience.

Big Data has a huge potential to feed both spatial analysis and modeling, and the geographic knowledge discovery. Nonetheless, there are still some remaining issues, e.g., data validation, non-causal relationship guiding incorrect conclusions, and the creation of understandable data-driven models. The impact of big data and data-driven geography on society remains a current agenda (Mayer-Schonberger and Cukier 2014). The main concern is with privacy, not only because of the people but also because of the potential repercussions that may stop data-driven research.

Being big data more and more rooted into social-spatial decisions, processes, and institutions, the signifier-signified connection may come to be increasingly fuzzy. As long as we place even more trust in big data, and in the algorithms that are used to produce and analyse it, it is also more likely to lose sight of the big picture of such data represents while distorting the ontological-epistemological boundaries (González-Bailón 2013). This leads to a situation where it is normal to take decisions upon complex data, processed by black-boxed algorithms running in unopen software (Graham 2013).

Still, big data is no danger free, and there is the potential risk of simplifying human agency and the data production frame (Boyd and Crawford 2012; Tinati et al. 2014; Schroeder 2014). Some experiences tell us that, advancements in ICTs, far from being inclusive, often enlarge the socio-spatial roughness of both representation and participation, as evidenced on a variety of online datasets (Graham 2011; Haklay 2013).

Another big data ethical risk derives from what sometimes is designated as machine bias (Angwin et al. 2016). Whereas data are often assumed as objective, big data and the surrounding algorithms may not be. Muñoz and colleagues (2016) show prominent examples of how “bad” data (e.g., badly selected, incomplete, incorrect, or outdated) can lead to discriminatory (biased) outcomes. Keeping that in mind, some attention should be taken, because artificial intelligence can be just as biased as human beings, i.e., discrimination can exist in machine learning.

Indeed, extensive improvements on ICTs have augmented the multimedia narratives about the geographical representation of places, with important implications on the future of geography. Geographers integrating big data with current research paradigms have already transformed and promoted the study of geographical systems and, in the process, have developed new notions of space. This is an opportunity for new research techniques in both the qualitative and quantitative contexts. Big data and data analytics improve the understanding about the consumption of urban space (public or private), leaving physical and digital space, respectively fixed and fluid, while both of them overlap and coexist, each one shaped by the other and its users.