Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Spatial data analysis and GIS are instrumental components for examining the spatial dimension of regional science. As the role of geographical space has been increasingly recognized in science, social science, and the humanities, partially driven by the explosive growth of GIS, so spatial analysis has become progressively embedded within statistical analysis and modeling. GIS has enabled ever greater access to rapidly expanding quantities of digital spatial data. In a seminal paper Anselin and Getis (1992) made a distinction between confirmatory and exploratory data analysis where, although the edges of both were blurred, the former was largely deductive and theory driven awhile EDA was inductive and data driven. In reality this distinction should be questioned for it suggests that the EDA process begins with little prior theoretical understanding of the problem or of the datasets and is essentially a ‘fishing expedition’ of available databases before the real deductive theory-driven analysis begins. Few reputable studies proceed in such a fashion. In consigning EDA to a predominantly data driven, atheoretical approach using largely descriptive techniques, the real power and insight that EDA provides is minimized to a lesser role than it deserves. Goodchild (2010) alludes to this point in that the data driven approach and the search for pattern was often viewed as being independent of any theoretical framework and to some degree contributed to the social-theoretic critique of GIS as being essentially not concerned with theory (Pickles 1995).

Importantly, however, Anselin and Getis recognized that to serve the spatial needs of regional science, an integration of spatial analysis and GIS based on computationally intensive approaches and visualizations was required. In recent decades, GIS and cyberinfrastructure have brought about a revolution in the availability of spatial data. In addition, the spatial analytical power of GIS and geospatial technologies have contributed to a substantial restructuring of regional science. Many analog map collections have been converted to coordinate-based digital form, and digitally-born spatially referenced data are now available in ever increasing quantities and consumed through spatial data portals and across the Internet. GIS has changed the spatial analysis landscape in many other ways as well, “The geospatial world of today is clearly a much broader domain of data, tools, services, and concepts than the limited GIS world of 1992” (Goodchild 2010, 55). In this respect, the statistical tool box proposed by Anselin and Getis is in many respects a redundant notion, for many software systems are now hybridized and enable sophisticated spatial data analysis to be performed. Significantly, however, Anselin and Getis proposed a dynamic and iterative approach to data analysis in the form of ESDA that, by drawing on EDA and the spatial data management and processing power of GIS, facilitated a tighter interaction between the user and spatial data analysis in a highly interactive and reflexive analytical and graphical environment. Central to shaping ESDA and its spatial extension was the pioneering work of John Tukey (1977).

2 Exploratory Data Analysis

Exploratory Spatial Data Analysis advances Tukey’s (1977) seminal work on EDA through the tight coupling of geographical space to traditional EDA approaches. While this antecedence to ESDA is often recognized and acknowledged, the unique contribution of EDA to data analysis as espoused by Tukey is sometimes lost in the flurry to examine the spatial dimensions of ESDA. EDA is a critical starting point to research analysis, and there is a tendency to miss this exploratory step in the jump to confirmatory and inferential statistics. Understanding Tukey’s work can be valuable to regional scientists, for EDA represents both a philosophical and a methodological approach to data analysis.

EDA stands in some contrast to confirmatory inferential statistics by its emphasis on hypothesis generation rather than on hypothesis testing and confirmation. Tukey’s work paved the way for an alternative, yet in many ways a complementary, approach to inferential statistical data analysis. The ideas of Tukey concerning EDA have been pursued and promoted by several authors who provide excellent insight into the essential message of Tukey and his nuanced approach to data analysis through EDA (Chatfield 1985, 1986; Cox and Jones 1981; Hartwig 1979; Hartwig and Dearing 1983; Hoaglin et al. 1983, 1985, 1991; Mosteller 1985; Sibley 1988). At its core, EDA focuses on exploring the properties of data and to use these findings to raise questions, pursue ideas, and generate hypotheses that can be subsequently tested through confirmatory data analysis. Tukey (1977, 1) claimed that, “Exploratory data analysis is detective work…numerical detective work…or graphical detective work…that requires both tools and understanding” (italics added). Tukey questioned the ability of inferential statistics to uncover ideas and hypotheses worthy of further investigation and that hypothesis testing alone often ended in a dead-end and provided little guidance as to the directions that a study should proceed. Ideas, Tukey suggested, came from data exploration more often than from “lightning strokes”: “Finding the question is often more important than finding the answer” (Tukey 1980, 23–24). Indeed to extend the premise still further, Tukey argued that, “An approximate answer to the right problem is worth a good deal more than an exact answer to the wrong question, which can always be made precise” (Tukey 1962, 13).

Much of Tukey’s work in EDA represents a critique of traditional confirmatory inferential statistics and a resistance to the Neyman-Pearson approach to confirmatory analysis and a seeming unwillingness to examine the data prior to pursuing inferential statistical analysis (Fernholz and Morgenthaler 2000, 84). Tukey expressed concern from the outset about the ‘straight-line paradigm’ of confirmatory statistics that seemed to proceed linearly from question, to design, to data collection, to data analysis, and then to answer. One of his primary concerns was that this sequential process neglected how the questions are generated in the first instance. Furthermore, he questioned, how could the research design be guided, or the data collection monitored, or analysis overseen to avoid inappropriate use of statistical models if not by exploring the data before, during and after analysis (Tukey 1980, 23). To pursue confirmatory analysis, he argued, requires substantial exploratory work coupled with quasi-theoretical insight. Tukey suggested reorganizing the early stage of the straight-line paradigm such that a study proceeded from an idea, to an iterative combination of question and analytical (re)design, and thence to data collection, analysis, and outcome (ibid.). In this approach, the formulation of the ideas and questions are critical, yet as Tukey argued, such questions are rarely ‘tidy’ but rather are inchoate and require extensive exploration of past data (ibid., 24). Tukey saw the essential need for EDA to assist in formulating the questions deserving of subsequent confirmation (ibid., 24). Tukey did not reject confirmatory data analysis in favor of EDA for he argued that each on its own was insufficient: “To try to replace either by the other is madness. We need them both” (ibid., 23). A circular paradigm thus emerges, rather than a linear process, whereby theory defines the problem and EDA provides a feedback loop between the analysis and theoretical formation allowing for subsequent inferential analyses to be pursued or modified in the light of such exploratory work. Thus, analysis and theoretical understanding are enmeshed and not separate stages of an investigation. In this way, EDA emphasizes a constant, but meaningful, return to the data ‘honeypot’ and, as Tukey remarked, torturing the data until it has revealed all and has no more to confess.

Tukey argued against EDA being seen as comprising wholly descriptive statistics but rather that EDA was an “attitude” and a “flexibility”, supported by visual representations and “some helpful techniques” (ibid., 25). Tukey’s work in EDA can be seen as providing two primary themes to data analysis. First, he presented a practical philosophy as to how to proceed systematically through a data analysis and especially how to begin that process (Good 1983; Tukey and Wilk 1970). In my experience of teaching ESDA, this practical approach, which can be brought to bear on almost any data analysis, has been of greatest value to many students who balk at where to begin and how to proceed through the data analytical process. The acid test, of course, is to present students with a data set that is known to them and to watch the barrage of inappropriate inferential statistics thrown at the data that invariably fails to generate much understanding or substance from the analysis. EDA more closely replicates the process followed by experienced researchers when analyzing a dataset and provides a practical path through the data analysis process than can be gained from any rigid adherence to standard statistical text books–in geography or otherwise. Despite this valuable practical philosophy, however, Tukey argued against EDA being seen as a kind of theory of data analysis (Fernholz and Morgenthaler 2000, 84).

Second, EDA places considerable emphasis on techniques that are both robust and resistant (Besag 1981; Mosteller 1985; Mosteller and Tukey 1977; Velleman and Hoaglin 1981). Tukey suggested that invariably little is known about the data to which we apply statistical models and, thus, there is a need to explore the data using techniques that minimize prior assumptions about the data and the model (which assumptions he suggested are often violated in practice) and allowed exploration of the data to guide the choice of appropriate questions and analysis. His focus on nonparametric statistics spurred the identification of innovative techniques that were resistant to the effects of extraordinary data values that could unduly influence the results of an analysis, and were robust and lessened the reliance on the assumptions of the data distribution and were essentially distribution free. Thus, the median, interquartile range and percentiles are preferred over the mean and standard deviation because they are more resistant to extreme values and outliers. Creative techniques for univariate, bi-variate, and hypervariate EDA such as boxplots, stem-and-leaf diagrams, q-q plots and Tukey mean-difference plots, parallel coordinate plots, lowess curves and local regression, multi-dot displays, compound filter smoothers of running medians for resistant time series analysis, resistant linear regression, scattergram matrices, and conditional plots provide a diverse mix of robust and resistant EDA techniques that complement more ‘fragile’ and less resistant measures (Cleveland 1993; Velleman and Hoaglin 1981; Tukey 1977).

A further characteristic of EDA is its focus on the ‘five number’ summary statistics of minimum, upper and lower quartiles, median, and maximum. This focus on the shape, spread, and central tendency of a distribution and on identifying and examining anomalies, outliers, trends, patterns, and residuals is central to EDA. EDA uses techniques that resist reductionism and summary statistics but tries to keep the original data present at all times. Thus, stem and leaf diagrams are preferred over histograms whose bins ‘hide’ the original data values. EDA places heavy emphasis on descriptive statistics, and it is here that it battles with the perception that EDA and its statistics are somewhat obvious and trivial and that inferential statistics and progressively more abstract statistical models represent greater legitimacy and intellectual value. To claim a focus on a data distribution curve, for example, may at first sight seem basic, yet the personal story of the eminent paleontologist Jay Stephen Gould represents a powerful example of the importance of just one of these descriptive measures. In the Median isn’t the message, Gould (1985) recounts being diagnosed with mesothelioma cancer of the abdomen and being told that the median lifespan for people with this disease was 8 months. But Gould’s research fascination with variation and his training as a scientist led him to determine that the distribution curve was positively skewed and that for a number of reasons, including good health care, early detection of the cancer, and no other healthcare problems, that he could place himself well into the long tail of the distribution. Indeed Gould survived a further 20 years and died of a different cancer. In a prefatory note Steve Dunn calls Gould’s article “the wisest, most humane thing ever written about cancer and statistics” (Dunn 2002).

Thus, EDA resists the allure of the ‘magic number syndrome’ whereby complex distributions and patterns are reduced to summary numerical form that potentially hides the real pattern or complexity of the data. For this and for other reasons, there is a heavily reliance in EDA on graphical display. As Tukey (1977, vi) contended, “The greatest value of a picture is when it forces us to notice what we never expected to see”. The visualization work of Cleveland (1993) and Tufte (1983, 1990) has added considerably to the emphasis on graphical representation in data analysis and to the suite of techniques available in EDA. This focus on exploratory techniques and lessened reliance by EDA on preconceptions and assumptions about data stands in contrast to confirmatory statistics that seeks to make broad conclusions and generalizations about a population based on the inferences drawn from the relationships found in a random sample of that population. The focus of inferential statistics on a priori hypothesis testing and probabilistic models and the derivation of estimates and confidence levels points to the need for descriptive statistics of the data as a preliminary step before a statistical model is applied or inferences are generalized about a larger population. EDA is particularly suited to the creative exploration of data and to generating questions and hypotheses, even though it often does not provide definitive answers. As Tukey indicated, EDA is not the whole story but, as he observed, if you took 1000 books on statistics in the 1970s, 999 would be confirmatory. Arguably, the same assessment is probably not that much different today except that to EDA might be added space to create ESDA and its additional focus on understanding the spatial dimensions of data and hypothesis generation.

3 Spatial Extensions to Exploratory Data Analysis

If EDA is about using robust, resistant, and graphical techniques to identify, understand, and gain insight into the essential properties of data, then ESDA is an extension to that process that seeks to detect spatial patterns in the data, to formulate hypotheses based on the geography of that data and to assess the appropriateness and assumptions of spatial models (Haining 2009). ESDA utilizes recent and dramatic advances in interactive desktop computer processing and computer graphics to create an exploratory analytical environment capable of linking EDA and spatial data analysis. ESDA provides a powerful idea and hypothesis generation platform with which to undertake complex spatial data analysis and integrates well with recent advances in local spatial statistical techniques, GIS, and geovisualization. The spatial and statistical modeling needs of regional science coupled with ongoing advances in big data and spatial data mining suggests ESDA will be of growing importance in geographical analysis and regional science in the future. The growing availability of hybrid software systems capable of handling spatial data have contributed markedly to the ability to perform ESDA. S-Plus was an early software system equipped with a bridge to ESRI’s GIS system though this was subsequently discontinued. Currently there are several analytics systems capable of performing ESDA that include Tableau (www.Tableau.com), Cartovis (cartovis.com), ESRI’s Geostatistical Analyst (http://www.esri.com/software/arcgis/extensions/geostatistical), GeoVista (http://www.geovista.psu.edu/), Weave (https://www.oicweave.org/), and within the software environment R (https://www.r-project.org/). Perhaps best well known within geography and regional science is GeoDa (https://geodacenter.asu.edu/).

In addition to the work of Tukey and other researchers in EDA, ESDA has been heavily influenced by the early work of Monmonier (1989) on the geographic brushing of scatterplot matrices, Cleveland’s work on data visualization, and Sibley’s spatial applications of EDA (1988). In particular, ESDA owes much to the prescient work of Anselin (1993, 1999) who was not only early in identifying the potential for combining advances in GIS and spatial data management with spatial analysis and local spatial analysis but in providing the means to do so through GeoDa. While the linkage between ESDA and GIS has been somewhat tenuous, in reality the tight coupling of space and data analysis as evidenced by the development of the GeoDa software has made the link between GIS and EDA apparent and explicit. ESDA as envisaged by Anselin remains a subset of EDA rather than of GIS and it focuses on exploring the distinguishing characteristics of spatial data through a suite of techniques that specifically focus on spatial autocorrelation and spatial heterogeneity.

ESDA usually contains a similar collection of EDA techniques capable of exploring, describing and visualizing data, but with the additional capability of being able to handle spatial data and mapping. Anselin’s particular focus has been to make spatial autocorrelation and spatial heterogeneity central to his ESDA software development and focus (Anselin 2005). As Anselin writes “ESDA is a collection of techniques to describe and visualize spatial distributions, identify atypical locations or spatial outliers, discover patterns of spatial association, clusters or hot spots, and suggest spatial regimes or other forms of spatial heterogeneity. Central to this conceptualization is the notion of spatial autocorrelation or spatial association, i.e., the phenomenon where locational similarity (observations in spatial proximity) is matched by value similarity (attribute correlation)” (Anselin 1999, 79–80). Thus, in addition to many of the robust and resistant techniques to be found in EDA, and as outlined above, in ESDA the analyses are tightly coupled with spatial data and mapping. This tight coupling of spatial and attribute data occurs through brushing and linking of interconnected multiple dynamic window panels. Brushing and linking provides for a powerful exploratory capability not just between tables and graphics, but with maps. Currently GeoDa and similar systems dynamically link multiple panes or windows containing various analytical techniques that includes a map display (Fig. 11.1). Anselin provided an important step in ESDA in enabling EDA and spatial analysis to be tightly coupled. In addition to linking panels containing multiple simultaneous analyses, the ability to dynamically ‘brush’ individual or groups of data items in any panel or map display and to see the corresponding data points or relationships highlighted in the other panels creates a truly powerful exploratory tool. These compelling visual and dynamic displays of multiple analyses are actively and dynamically linked to enable spatial patterns and spatial relationships to be examined, as well as to identify anomalies and outliers. Brushing not only allows for data points selected in one analytical panel to be automatically identified and displayed across all panels, but it is also possible to brush locations on a map to see the respective data displayed in the other panels and vice versa.

Fig. 11.1
figure 1

GeoDa in the Virtual Reality CAVE displaying multiple dynamically linked panels of analyses linked via brushing and linking to each other and to the map display on the floor

In addition to brushing and linking, GeoDa enables both global spatial autocorrelation to be examined using Moran’s I, and local spatial autocorrelation using Local Indicators of Spatial Autocorrelation that indicate the specific location and magnitude of spatial autocorrelation to be identified (Anselin 1993). In instances where spatial patterns can be discerned, it is reasonable to assume that the spatial data are related and not independent, and tests for spatial autocorrelation using a spatial weights matrix can be applied based on locational contiguity to test for positive similarity between adjacent spatial units or for negative spatial autocorrelation and dissimilar patterns. A focus on spatial outliers reinforces the exploratory work of Tukey to not ignore anomalies but to embrace their study and the insights that they provide. In tandem with Geographically Weighted Regression (Fotheringham et al. 2002) that identifies the occurrence of spatial non-stationarity and allows relationships to vary over space, the move toward local spatial statistics lends itself well to ESDA. The empirical Bayesian kriging of ESRI’s Geostatistical Analyst employs the semi-variogram to identify directional bias in correlations between sample points, and using spatial covariance between data points adjusts the weights of contributing sample points to optimize model interpolators for spatially continuous fields. Thus, within ESDA concepts of distance, adjacency, interaction, and neighborhood spatially enrich the field of statistics that has been relatively insensitive and unsuitable to geographical investigation before the inclusion of the spatial dimension. In overcoming the sampling of data points independent of the characteristics of the data being interpolated, these local spatially adaptive weighting functions are progressively embedding Tobler’s (1970) first law of geography into contemporary spatial analysis and within the spatially enabled ESDA in particular.

4 Discussion

It is suggested here that EDA, and its spatial counterpart ESDA, provide a powerful, systematic and intuitive approach to spatial data analysis and a necessary precursor to the use of inferential statistics. Despite the embeddedness of these exploratory techniques within ESDA, the extent to which the premises and approaches of Tukey’s EDA have been recognized and accepted within regional science as necessary and complementary steps in the spatial data analysis process is not clear. EDA is still seen as ‘descriptive’ and a ‘warm-up exercise’ to the real statistical analysis using confirmatory techniques. This perception diminishes the real value of EDA to understand the very nature of a data set. In particular, the potential for ESDA to formulate ideas and hypotheses for pursuit either within the ESDA environment or with confirmatory inferential statistics could represent missed opportunities. ESDA does more than enhance the spatial analytical capabilities of GIS, it represents a powerful approach to gain insight into the heart of the data.

Part of the reason for not fully embracing EDA may be, as others have pointed out (Goodchild 2010; Haining 2009), that in the face of big data and the growing availability of spatial data from GIS, the preference of some is to seek patterns and anomalies automatically. This, of course, flies in the face of Tukey’s conception and purpose for EDA. Barnes (2003) in his critique of American regional science arguably suggests that the decline among regional science practitioners could have been avoided. Barnes contends of regional science that, “It is unreflective, and consequently inured to change, because of a commitment to a God’s eye view. It is so convinced of its own rightness, of its Archimedean position, that it remained aloof and invariant, rather than being sensitive to its changing local context”. The advent of ESDA may be one change that will resonate with regional science and that by drawing on inductive reasoning (and arguably deductive as well through EDAs circular reasoning) provides a reflective and exploratory environment that is creative and open-ended. As Tukey would argue “Exploratory data analysis is an attitude, a flexibility, NOT a bundle of techniques…” (Tukey 1980, 23).

In the coming decades, and fueled by a potential avalanche of spatially rich data repositories created from a combination of automatic data sensors and human data generators, regional science will be challenged not only by data storage, curation, search, and query issues, but by how meaningful spatial data analysis of big data will be performed. The profile of ESDA could increase as its philosophy, tools, and techniques are brought to bear on big data to gain an understanding of extremely large and complex spatial datasets. Statistical analyses and visualization technologies struggle with big data in handling the sheer high volume, high velocity, high variety, and increasingly high veracity characteristics of these data assets (Gandomi and Haider 2015). The application of intelligent machine learning approaches replicate some of the early focus of spatial analysis in GIS on automatically detecting patterns from complex data. Amidst assertions that big data will spell the end of theory, a major challenge posed by big data is that little is known about the underlying empirical micro-process that lead to the emergence of the typical network characteristics of big data. And yet, such scenarios continue to beg the question that Tukey laid out nearly four decades ago—how are meaningful questions and hypotheses to be formed without an intensive exploration of the data?

Searching for plausible hypotheses, especially where the spatial pattern is not common knowledge, is problematic. Shekhar and Chawla (2003) proposed the use of interactive exploratory analysis to bring together a number of analytical panels that closely mirror an ESDA approach. Spatial data mining, they suggest, differs from spatial data analysis by its usage of techniques derived from spatial statistics, spatial analysis, machine learning and databases. The output from an iterative spatial data mining process, suggests Shekhar and Chawla, is typically a hypothesis (ibid., 237). One way they suggest to view data mining is as a filter step that occurs before the application of a rigorous statistical tool: “The role of the filter step is to literally plow through reams of data and generate some potentially interesting hypothesis which can then be verified using statistics” (ibid., 240). Thus, a key part of spatial data mining of big data is to comb through big databases in order to identify information that is relevant to building actionable models. As regional science confronts ever larger and more complex spatial databases, these exploratory techniques may take on greater importance in positioning the science and research questions to be pursued.

In the days soon after the publication of Tukey’s seminal work, Cox and Jones (1981, 142) made a plea that “it is to be hoped that quantitative geography…will be less afflicted than in the past by a craving for the semblance of elegance, exactness, and rigour exuded by inferential ideas, and that geographers will show more willingness to engage in uninhibited exploration of their data, guided but not dominated by the procedures devised by statisticians”. Ten years later and following an NCGIA specialist meeting, Fotheringham (1992, 1676) reported that there might be instances “in certain circumstances” (italics in the original) where exploratory spatial data techniques within GIS might be appropriate. These circumstances appear to apply to spatial windowing to analyze data on the fly as the window is moved around a set of locations, for detecting spatial outliers, for disaggregating statistics spatially, and to visualize spatial data. In the wake of the GIS revolution, the growing abundance of digital spatial data, the era of big data, the rise of data mining, and the availability of ever more powerful computing and graphical visualization resources and hybrid software solutions, such hopes for ESDA may be closer to reality now than they were three or more decades ago.