Keywords

17.1 Challenges in Spatial Analytics

As a set of quantitative and computational approaches for analyzing geospatial data, spatial analytics is the core of Geographic Information Science (GIScience) for exploration, knowledge discovery, and decision making in the spatial realm. Identified by Golledge (2009) as the unique contribution by geographers to the scientific community, spatial analysis is defined as the methods developed exclusively for analyzing location-based information. Location-based data need specialized analytics to handle spatial dependence, scale dependence, and ecological fallacy, which are not sufficiently accounted for using conventional statistical methods. In the past decades, as spatial theory and computing technology advanced, spatial analysis expanded considerably to cover spatial statistics (for example, exploratory spatial data analysis and spatial regression), spatial simulation (such as agent-based modeling and microsimulation), spatial optimization (Murray, 2021), and data-driven techniques, such as data mining and artificial intelligence (Li, 2020).

Despite covering remarkable breadth, spatial analytics still faces substantial challenges. Goodchild (2009) identified notable issues that spatial analysis is facing. From the perspective of technology, the trend towards the migration of spatial analytical functions to the Web necessitates new business models. New models would ideally handle server-client communication and interoperability and manage data innovatively for online parallel processing services that require use of server-client communication. They also would ideally promote transparency in spatial analysis modules available online. From the science perspective, a (re)formulation of GIScience based on how spatial analytics are being used in scientific and practical problem solving would be beneficial. Over a decade later, we ask “how has the research landscape of spatial analysis changed, how well were Goodchild’s challenges addressed, and what new challenges are emerging?”.

The last 10 years have witnessed revolutionary advances in technology. Although the term ‘cloud computing’ was new a decade ago, it has become prevalent today to support the storage, computing, and analysis of geospatial data and its applications (Li et al., 2016). Instead of maintaining a dedicated server, geographic information system (GIS) users and developers have increasingly used cloud infrastructure based on highly reliable virtualized cloud machines capable of elastic computing to meet the different needs of end users. For example, Google Earth Engine, Google’s cloud platform that hosts multi-decades of remote sensing images, offers the public rapid access to massive geospatial data and planetary-scale spatial analytics (Gorelick et al., 2017; Yang et al., 2018). The emergence of cyberinfrastructure and CyberGIS has also revolutionized the landscape of spatial analysis to allow collaborative data sharing, analytics, and decision-making (Anselin & Rey, 2012; Li et al., 2016, 2019a, 2019b; Wang, 2010; Yang et al., 2017).

Despite these advances, spatial analytics still have existing and new challenges. Here we present a few examples of these challenges from the computational and data science perspectives.

17.1.1 The Size Challenge of Big Data

Big data have changed nearly every aspect of our lives and the way we conduct science. Datasets, such as earth observation and remote sensing images, images from unmanned aerial vehicles (UAVs), and georeferenced data from social media platforms and sensors for the Internet of Things (IoT) have yielded the production and availability of geospatial data at unprecedented spatial and temporal coverage, resolution, and collection frequency (Li et al., 2020). Handling these data at high throughput and in real-time has presented considerable challenges for traditional analytical methods designed for processing small, clean datasets (Li et al., 2022). Spatial statistical methods, for instance, often require an abstraction of raw data to point data in tabular forms to identify clustering patterns or the associations between certain numerical attributes through linear regression. These methods have reached limitations when it comes to analyzing big data, which are, by definition, large, noisy, diverse, and complex. Although redesigning existing statistical methods to handle big data has been attempted (Laura et al., 2015; Li et al., 2019a, 2019b), many widely used spatial statistical software, such as PySAL (Python Spatial Analysis Library) (Rey et al., 2015) and Geographically Weighted Regression (GWR) (Oshan et al., 2019), continue deployment in desktop computing environments and lack the utilization of advanced computing devices, such as Graphics Processing Units (GPUs). This is likely because the focus of innovation remains on methodology rather than computational performance. In addition, to handle big data, sampling approaches are often introduced. However, in a large dataset with an unknown distribution, it is difficult to guarantee that conventional sampling does not introduce bias into the data, for example in sub-setting training and test sets.

17.1.2 Navigating Through the Messiness of Big Data

Conventionally, big data equals messy data. At the rates data are generated today, the diversity in data collection methods makes (timely) quality control difficult. For example, very fast sampling of some phenomena, such as an event of interest that occurs sporadically, can lead to many empty records. Data reduction can introduce problems, such as when stacking large numbers of raster images over time and then computing a mean or median response in co-located pixels, one can end up with a median image that is too dark in areas of dense cloud cover. Resampling issues result in less accurate results when images are not registered uniformly, and their pixels are aligned. Such issues are easier to detect in small datasets than in large ones. Hence, the ability to navigate through big, complex data becomes a new challenge that calls for innovative techniques designed for big data analytics. Census data for the 2020 Census alone cost the U.S. Census Bureau over $14 billion for compilation and delivery (GAO, 2021). This is one example of high quality, official data managed by governments. However, many other big datasets are created from social media and crowd sourcing platforms, such as Twitter, which have been increasingly used for research because of their broad spatial coverage, richness of content, and low collection cost. However, data from these platforms inevitably contain a substantial amount of noise due partially to their openness, which allows anyone to say anything at any time. In Bayesian statistics, where random variables are introduced, determining the proper prior distribution is often needed to make the estimated posterior distribution match with reality. In such cases, data noise will impede the accurate estimation of a prior distribution. The resulting errors will propagate to later stages of the inference process and lead to imprecise results.

17.1.3 Hypothesis Test Versus Knowledge Mining

Besides relying on well processed data, the traditional spatial analytical approach also requires an accurate understanding and prior knowledge of the underlying process. For instance, in agent-based modeling, heuristic rules need to be defined to guide how an agent moves in space and interacts with the environment and other agents (Li et al., 2020). When applying regression analysis, one needs to specifically define both the independent (X) and dependent variables (y) when building the model, which means we should have knowledge about how X are affecting y. The goal of the analytics is to explain whether and how these independent variables (for example, income or climate) affect the dependent variable (such as housing price) in a geographical region. To incorporate geographically varying effects resulting from spatial heterogeneity, local modeling, such as GWR, is introduced to determine the variation of effects across space. These analyses belong in general to the testing of a hypothesis or identifying the degree of effect between X and y in a predefined model. Whereas such methods are known to be effective in identifying patterns that are expected, their ability to discover or learn unknown relations is weak.

Confronting these challenges requires new spatial analytical methods capable of mining new knowledge from large datasets containing unanticipated or previously unknown patterns, as well as being tolerant to noise. The methods also would ideally be able to learn to model the process itself rather than relying on definitions drawn from prior knowledge. GeoAI has emerged as a new arena for attacking these challenges.

17.2 GeoAI: A New Form of Spatial Analytics

GeoAI, or geospatial artificial intelligence, is a transdisciplinary research area integrating cutting edge AI to solve geospatial problems (Li, 2020). In the past decade, amazing progress has been made in the field of AI, particularly in machine learning and deep learning. The convolutional neural network (CNN) framework is a milestone development (Reichstein et al., 2019). The CNN framework adopts the novel concept of artificial neural network (ANN) in building a computer model mimicking the biological neural network of the human brain even as it brings transformative changes through the introduction of the convolution modules (Fukushima, 2007; Li, 2021; Li et al., 2012; Zhang, 1988). Such modules can conduct information extraction (also known as feature extraction, with each feature treated being the independent variable X in a regression process) from the raw data. CNN-based techniques, therefore, can directly act on the raw data and uncover hidden patterns through deep mining and iterative learning. This kind of data-driven analysis relaxes the constraint in traditional spatial analytics for assuming any predefined rules or relationships between the data (input) and the objective (output), thus supporting discovery and pattern recognition directly from data. This is also known as data-driven discovery (Miller & Goodchild, 2015; Yuan et al., 2004).

Another breakthrough in the design of CNNs is that each convolution layer (Albawi et al., 2017) performs local operations on the data, making parallel computation possible. This design lifts the computation constraint in traditional ANN that has high dependency among the artificial neurons across the fully connected layers. The recent development of high-speed GPUs that contain a few hundred to several thousand micro-processing units allows the high-performance training of CNNs, even with complex structures, on its computing units running in parallel. This also empowers a deep learning model to process big data, furthering its ability to detect new patterns, extract useful information, and create high-quality foundational datasets to aid the elucidation of important scientific questions (Arundel et al. 2020).

Moreover, deep learning models are arguably better at handling noise in training labels than traditional statistical methods (Rolnick et al., 2017). Because many such models are designed to learn complex relations, they tend to overfit the training data. Overfitting occurs when a model fits the training data exactly. When this happens, the model’s performance on unseen data will be inferior. One solution is to add noise to the training data such that the model will fit less perfectly, reducing the likelihood of overfitting, and increasing predictive accuracy. In addition, strategies, such as increasing the batch size and thus exposing the model to more samples for updating its parameters during the iterative learning process, lowering learning rates, allowing a more thorough search for the optimal solutions, and providing enough correctly labeled samples, will enable a deep learning model to tackle even extremely noisy data (Rolnick et al., 2017). Although noise in big data is inevitable, the way deep learning is designed and how it handles the data makes deep learning more robust towards dealing with noise than traditional spatial analytical approaches. On the other hand, deep learning requires thousands to billions of training examples to develop abstractions that the human brain can easily intuit through explicit, verbal definition (Marcus, 2018). Interpretability of the results and extension beyond the scope of the training data are also limitations to deep learning systems (Reichstein et al., 2019) that must be overcome.

17.3 Concluding Remarks

As a new form of spatial analytics, GeoAI is exciting because of its outstanding performance in big data analytics, especially in classification, prediction, and pattern recognition. However, the GeoAI domain is still in its infancy and more research is needed for it to become a well-established scientific field. The role of GeoAI in (re)formulating GIScience also needs to be more clearly defined. This need echoes insights shared by Goodchild (2009) in terms of the challenges of spatial analytics in general. We know that the complexity and black-box nature of GeoAI models render the model’s reasoning process more difficult to explain than that of traditional spatial analytical approaches (Goodchild & Li, 2021). But this also offers an opportunity to create an even more powerful analytical framework by combining GeoAI and traditional methods. GeoAI can serve as a data pre-processing module that directly interacts with raw big data to achieve high-yield analysis and data filtering (Li et al., 2022).

For instance, a GeoAI-based analytical framework can achieve near real-time processing of satellite remote sensing imagery to create a national to global scale database characterizing natural and human-made features on Earth (Li & Hsu, 2020). This dataset, for which scientists and researchers have waited decades, can be integrated into subsequently processed statistical models to understand crucial environmental and climate change problems (Reichstein et al., 2019). The data and models may jointly contribute to a convergent research agenda for spatial analytics.

Clearly, the development and refinement of existing and future spatial analytics (GeoAI and beyond) should consider fundamental geospatial principles, such as location, scale, spatial autocorrelation, spatial heterogeneity, and geographic similarity. As data and systems become more open, they are less likely to follow fundamental principles and best practices. This concern is like that expressed by scholars during the early years of the development of GIS. Concerns included whether users would utilize the correct projection for the variable studied, correct their statistical analyses for bias in location, or analyze error by combining the variables of the spatial themes.

Whereas some elements of these potential problems are now controlled inherently by software systems, other problems persist or may not be envisioned in the present. Like GIS, GeoAI and subsequent technologies would ideally balance the accessibility of the approach with its applicability, the enforcement of the principles with the flexibility of application. This is the grand challenge of the spatial science community: to not only create and disseminate new tools towards the goal of empowering more vast and ethical utilization, but more importantly to leverage these tools to improve analysis of spatial information to address critical global, regional, and local problems.