1 Introduction

Distributed computing frameworks allow users to parallelize work across many cores to significantly reduce data access and processing time. By using data replication and storage across a cluster, these frameworks provide a powerful and scalable solution for handling massive and complex datasets. While the benefits of these frameworks are numerous, there are also inherent limitations that make them challenging to utilize when working with geospatial data. This paper will discuss these limitations and describe how we implemented efficient and effective solutions to these challenges within the open source framework GeoWave (https://github.com/locationtech/geowave). At its core, GeoWave is a software library that connects the scalability of various distributed computing frameworks and key-value stores with modern geospatial software to store, retrieve, and analyze massive geospatial datasets. GeoWave also contains intuitive command line tools, web services, and package installers. These features are designed to help minimize the learning curve for geospatial developers looking to utilize distributed computing, and for Big Data and distributed computing users looking to harness the power of Geographic Information Systems (GIS). The challenges we discuss in this paper are:

  • Storing multidimensional data in a single dimensional key-value store leading to locality degradation

  • Limiting the effectiveness of a distributed system due to downstream bottlenecks

  • Working with multiple data types in a single table

  • Creating GIS tools that work with multiple distributed key-value stores.

The greatest challenge in integrating geospatial systems into distributed clusters is the inability to preserve locality in multidimensional data types. Often when working with geospatial data, an analyst or developer does not want or need to use the entirety of a multi-billion point dataset for a spatially or temporally localized use case. It is beneficial to search only for qualified criteria within a bounding box around a city instead of searching a table that includes data of an entire country. Likewise, consider a user who wants to perform a k-means clustering analysis of data within a single city during the last month of a global, multi-year dataset. This analysis is considerably more efficient if the data is stored in a common, indexed location within the datastore. Unfortunately, distributed frameworks often store data in single dimensional keys that are then stored in the cluster using random hash functions [6, 14]. This feature, common to most distributed key-value stores, normally makes the previously mentioned use cases extremely challenging at scale. Naively, the entire dataset will need to be searched regardless of whether or not each data point is applicable for the intended use. Since geospatial data is naturally multidimensional, storing it primarily in a single dimension necessitates a tradeoff in the locality of every other dimension.

We solve this challenge and preserve locality in the GeoWave framework by using Space Filling Curves (SFCs) to decompose our bounding hyperrectangles into a series of single dimensional ranges. An SFC is a continuous, surjective mapping from R to R d, where d is the number of dimensions being filled [9]. The complexities of the decomposition process are further discussed in Sect. 3.1 and related work specific to SFCs is part of Sect. 2.

A properly indexed distributed cluster can reduce the query and analytic times by multiple orders of magnitude, but these efficiency gains may still be masked by inefficiencies further downstream. One major bottleneck will be reached when attempting to render data in the context of a typical Web Map Service (WMS) request. Even a powerful rendering engine will cause serious delays when attempting to render billions of data points on a map. To display a WMS layer, a traditional rendering engine will first need to query and download all of the available data from the datastore. For smaller datasets this may not be a significant issue; however this will quickly become an unacceptable performance bottleneck when querying the massive datasets GeoWave was designed to work with. The fundamental issue here is the need to process all the data in order to visualize it. In this case that requires reading the data off of disk, but the same issue and scaling would take place if the data was stored in memory (system, GPU texture, etc.) – the constant factors and costs would simply change. In most cases this visualization of data ceases to convey additional information past a certain saturation point. A single pixel can only represent a finite amount of information at one time, so a high-level view of a WMS layer may have pixels representing thousands of individual data points. In this situation, there is a significant resource allocation directed at retrieving data that an end user will never actually see. We are able to make this process much more efficient by using spatial subsampling to identify and retrieve only the amount of data that will be visible in the layer. At a high level, GeoWave transforms the pixel space on the map into the context of the underlying SFC that is used to store the data. GeoWave then requests only the first data point that would be associated with that pixel. An in-depth description of this process is given in Sect. 3.2 of this paper. This approach turns what was a choke point into an interactive and responsive experience, and can be enabled by adding a simple pre-defined map stylization rule for a layer.

With continued advancements in data science practices it is often the case that a single type or source of data is not enough to answer a question or prove a hypothesis. The amount and variety of available data is growing at an ever-increasing rate. This should be to the benefit of an analyst. However, this data is often stored in different data structures and tables, adding a further level of complexity to its analysis. It may be beneficial to have these heterogeneous datasets in a commonly accessible location. Furthermore, for processing efficiency, it can be beneficial to have them indexed together and in fact, commingled in the same tables to prevent unnecessary client-server roundtrips when correlating these hybrid datasets. Mixing various data types may seem trivial but in spatiotemporal use cases, careful attention must be placed in storing attributes of the data that contribute to the SFC index in a common format and location. This will be discussed in more detail, but of particular note is that the SFC is intentionally an overly inclusive representation of the spatiotemporal bounds of the data. Careful attention is paid to eliminate false negatives, but false positives are possible using the SFC alone and it is only by explicitly reading and testing the actual attribute values that all false positives can be eliminated. The software design challenge is doing a fine-grained spatiotemporal evaluation on the server-side with many different types of data within the scope of the same scan. Regardless of the individual data type, an agnostic common model is used to store the indexed data. This common index model, along with the ability to add extended fields with a serialized adapter, allows the software to seamlessly interact with all varieties of data in a spatiotemporal context, as well as parse the extended information for each individual type. We will expand on this process and its architecture in Sect. 3.3.

As is the case with many modern technologies, there is not a single standard distributed computing framework that is used ubiquitously. While many function similarly, there are numerous options and in practice, there are generally performance tradeoffs that can make a certain framework better for a given task. A particular development team’s experience and preference is another contributing factor – the significant differences in implementation details between these technologies means that a sizable investment in time and resources may be necessary up front. A development team will need to become familiar with a new distributed framework before they can begin to benefit from a library designed to work solely with that framework. Therefore, GeoWave has been designed to be key-value store agnostic. Functionality and performance parity between supported key-value stores is achieved by drawing on as many commonalities to keep the functionality as generic and abstract as possible. This happens, while also paying careful attention to microbenchmarks in the cases that tradeoffs and underlying expected behavior differs. While the full breadth of this task is beyond the scope of this paper, however, some relevant examples are discussed in Sects. 3.4 and 4.3.

2 Related Work

The concept of multidimensional indexing, or hashing, has been well researched and there is a significant amount of related work in this field. Efficient multidimensional hashing algorithms, as well as the limits to n-dimensional locality preservation, have been proposed by authors such as Indyk [11]. More recently, projects like Spatial Hadoop [7] have been developed to use grid files and R-trees to index two dimensional data into the Hadoop ecosystem. We have chosen to use SFCs for the multidimensional indexing component within our GeoWave project. SFCs have been discussed and studied frequently for many years. The research that has most directly impacted our design is from the work of Haverkort [10] in measuring the effectiveness of various SFCs and the introduction of the Compact Hilbert SFC by Hamilton [9].

In the Haverkort paper [10], a series of measures (Worst-case Dilation/Locality [WL], Worst-case Bounding Box Area Ratio [WBA], and Average Total Bounding Box Area [ABA]) were devised to test the effectiveness of various types of SFCs. The results from these tests are summarized in Table 1.

Table 1. Summary of results from Haverkort comparison [10]

The Z-Order curve has been successfully used for multidimensional indexing in products like MD-HBase [13], but was by far the worst in these experiments. The βΩ and Hilbert curves performed extremely well across all of these locality-based measures, but there are drawbacks to them as well. A major limitation of these SFCs is that the cardinality, or grid size, must be the same in each dimension. This can pose a significant issue in the ability to use the curve in spatiotemporal applications as well as when we consider other common dimensions such as elevation. These use cases can benefit from dissimilar cardinalities among dimensions to enable trading off dimensional precision when common access patterns are likewise unbalanced in dimensional precision. Hamilton [9] solves this issue by defining a new version of the Hilbert SFC called the Compact Hilbert SFC. The Compact Hilbert SFC maintains all of the locality preservation advantages of a standard Hilbert SFC while allowing for varying sized cardinalities among dimensions. Furthermore, we were able to utilize an open source Compact Hilbert SFC implementation called Uzaygezen [15]. Contributions to Uzaygezen were made by members of Google’s Dublin Engineering Team and the word itself is Turkish for “space wanderer.”

3 Contributions

3.1 Locality Preservation of Multi-dimensional Values

In the introduction, we briefly discussed that the GeoWave framework uses SFCs to decompose bounding hyperrectangles into ranges in a single dimension. Here we will go into further detail about how this is accomplished and how it benefits us in terms of locality preservation. For simplicity, we will use a two-dimensional example with a uniform cardinality to example to explain this process; however, GeoWave is capable of performing this process over any number of dimensions with mixed cardinalities.

GeoWave uses recursive decomposition to build what are effectively quadtrees when applied in two-dimensions. A quadtree is a data index structure in which two-dimensional data is recursively broken down into four children per region [12]. These trees are built over a gridded index in which each elbow (discrete point) in the Hilbert SFC maps to a cell in the grid. These cells provide a numbering system that gives us the ability to refer to each individual grid location, or bucket, with a single value. The curves are composable and nested, giving us the ability to store curves and grids at various levels of precision where the number of bits of the Hilbert SFC determines the grid. Although our “quadtrees” lack the traditional direct parent-child relationships between progressive levels, this tiered method in which we store each level has a similar effect (Fig. 1).

Fig. 1.
figure 1

Visualization of a tiered, hierarchical SFC storage system

Now that we have a tiered, hierarchical, multi-resolution grid instance, we can use it to decompose a single query into a highly optimized set of query ranges. As a simple example, we step though the query process with a two dimensional fourth order Hilbert SFC that corresponds to a 16 by 16 cell grid (2^4 = 16). Laid out over a map of the world that grid would have a resolution of 11.25° latitude and 22.5° longitude per cell. Once a query bounding box has been selected, the first step is to find the minimally continuous quadrant bounding the selection window. In this example, a user has selected an area bounded by the range of (3,13) -> (6,14) on the 16 by 16 grid. The Hilbert value at each of the four corners would be:

  • (3,13) = 92

  • (3,14) = 91

  • (6,13) = 109

  • (6,14) = 104

We then take the minimum and maximum Hilbert values, as they are the most likely to be in separate quadtree regions, and perform an exclusive or (XOR) operation to find the first differing region. To be precise, we take the Hilbert values as base 2 and use bitwise operations to define x as the number of bits set from the right (front padding a 0 if the result is odd). We find the starting and ending intervals of the minimally continuous quadrant bounding the selection window by applying a 1, followed by x number of 0 s as the lower bound and a 1 followed by x number of 1 s as the upper bound. In the example above, we have 91 and 109 as our minimum and maximum Hilbert values. It follows that (01011011 XOR 01101101) = 00110110, which has six bits, so therefore, the bounds for our minimally continuous quadrant are: 01000000 = 64 and 01111111 = 127.

The previous step takes us to the deepest node of the quadtree that fully contains the queried range. Now, we recursively decompose each region into four sub-regions and check to see if the sub-region fully describes its overlapping portion of the original query. In the example shown in Figs. 2 and 3, the darkest shaded region represents the query range. We first broke down the minimally continuous quadrant into four sub-regions, then broke down each of those into four sub-regions of their own. At this point, we only continue decomposing regions that partially overlap the original search area. Sections that fully overlap the bounding box are left at their current tier, and sections that do not overlap any portion of the bounding box are skipped. We continue to decompose the area until we have broken the query bounding box down into a series of ranges along our Hilbert curve. This knowledge allows us to search only the keys within the original bounding box despite discontinuities along the curve.

Fig. 2.
figure 2

A query is first decomposed into sub-regions

Fig. 3.
figure 3

A query broken decomposed into a series of ranges

This example shows how GeoWave uses SFCs in bounded dimensions like latitude and longitude; however, unbounded dimensions like time, elevation, or perhaps velocity may add another wrinkle to the process. In order to normalize real world values to fit a space filling curve the values must be bounded. We solve this issue by binning using a defined periodicity. A bin simply represents a bounded period for each dimension. The default periodicity for time in the GeoWave framework is a year, but other periods can easily be set for use cases were it would prove more efficient. The efficiencies gained around choosing a periodicity are outside the scope of this paper, but the short conclusion one can draw is that it is beneficial to choose a periodicity based on the precision of your queries. Consider it significantly suboptimal at a course-grained level to have use cases in which a single query intersects several individual periods, and suboptimal at a fine-grained level to have use cases with a query that will sub-select significant data that is within the precision of a single cell (Fig. 4).

Fig. 4.
figure 4

Bins for a one year time periodicity

Each bin is a hyperrectangle representing ranges of data (based on the set periodicity) and labeled by points on a Hilbert SFC. Bounded dimensions such as latitude or longitude are assumed to have only a single bin. However, if multiple unbounded dimensions are used, combinations of the bins of all of the unbounded dimensions will define each individual bounded SFC. This method may cause duplicates to be returned in scans of entries that contain extents that cross periods necessitating their removal during the fine grained filtering process – preferably performed on the data node for implementations that allow custom server-side filtering (Fig. 5).

Fig. 5.
figure 5

Composition of multiple unbounded dimensions

3.2 Spatial Subsampling for Map Rendering

Earlier in this paper, we discussed the challenge of rendering massive datasets and gave a brief description of how we tackle the issue. In this section, we will go into further detail about the concepts we use in the GeoWave framework to solve these challenges. In order to visualize a dataset without having to process all of it, we must first break down tile requests into a request or range per pixel. Figure 6 shows this process for a subset of the pixels in a 256 by 256 pixel tile. Here you can see a representation of the correspondence of our index grid to our pixel grid.

Fig. 6.
figure 6

Visualization of conversion from tile request to pixel request

We could define a query per pixel at this point, but we would still need to load all of the data. This bottleneck has not been improved yet – we are just returning the same amount of data in pixel-sized chunks instead of tiles. The final piece of the puzzle is to short circuit, or skip, the read of other values within the precision of a pixel when any value is returned matching all filter criteria. The logic of this server-side row skipping is visualized in Fig. 7, and is implemented differently based on the datastore being used. As an example, we can extend the SkippingIterator class in Accumulo to scan the start of a range until a value is found, allow this value to stream back to the client, and seek to the start of the next pixel (key range) instead of reading the next key-value pair in the current range. This same functionality is carried out in HBase by applying a custom filter that behaves similarly, acting as a skipping filter that runs and seeks within the scope of a server-side scan.

Fig. 7.
figure 7

Visualization of server-side row skipping in the GeoWave framework

While seeking is less performant than sequential reading of data (this relationship is borne out in the testing results in Sects. 4.2 and 4.3), there comes a point where the sheer amount of data that has to be read far outweighs any performance penalty for random IO. In our testing this point occurs at the order of magnitude of millions of features but is heavily dependent on factors such as disk subsystem, IO caching, etc. The example above, while perfectly valid, does make certain assumptions that greatly simplify the situation and aren’t always valid in real world cases. In the provided example, the pixel grid very conveniently aligned with the index grid. This greatly simplified the logic involved in “skipping to the next pixel.” In the figure below, we demonstrate a situation where the index grid and the pixel grid do not align. In this case, we are not able to skip directly to the next pixel but instead must iterate through indexed “units.”

Again, considering our Hilbert SFC in two dimensions is much like a quadtree, we exploit the nesting properties of it in order to try and match pixel size to grid size. Though it is not guaranteed to match exactly (as demonstrated in Fig. 8), we do have a guarantee that worse case performance will be no worse than a factor of four over ideal. If it’s more than a factor of four, we can simply move one level up our quadtree, treating the index grid size as units of 4. This does assume that the cardinality of your index is at least a power of two higher than the cardinality of your zoom level (pixels per dimension). When this is not true, the performance of the subsampling process reverts to standard sequential IO. Using a cardinality of 31 as in the software’s default spatial dimensionality definitions, this maximum effective subsampling resolution is less than 2 cm per pixel at the equator.

Fig. 8.
figure 8

Our index grid is not always perfectly aligned with the pixel grid

3.3 Managing Data Variety and Complexity

We previously discussed how GeoWave uses a common index model to allow the comingling of disparate data types in the same table. Here we will expand on this concept and discuss how it is implemented in the GeoWave framework.

Figure 9 describes the default structure of entries in the datastore. The index ID comes directly from the tiered SFC implementation. Each element contains its corresponding SFC tier, bin, and compact Hilbert value. We do not impose a requirement that data IDs are globally unique but they should be unique for the individual adapter. In this way, an environment can be massively multi-tenant without imposing restrictions across individual usage. This pairing of “Adapter ID” and “Data ID” define a unique identifier for a data element that also creates a unique row ID. This is necessary as the indexed dimensions could be the same or similar, mapping to the same SFC value despite being different entities. Moreover, it is a fundamental requirement in a multi-tenant distributed key-value store to form unique row IDs for unique entries. It also allows us to rapidly filter out potential duplicates and false positives within server-side filtering and ensure complete accuracy of queries. Adapter and Data IDs can vary in length so we store their lengths as four byte integers to ensure proper readability. The number of duplicates is stored within the row ID as well to inform the de-duplication filter if this element needs to be temporarily stored to ensure no duplicates are sent to the caller. The Adapter ID is duplicated and used by the column family as the mechanism for adapter-specific queries to fetch only the appropriate column families. Accumulo has a concept of locality groups that triggers how column families are stored together and by default we use a locality group for each individual adapter (data type), but this behavior can be easily disabled if optimizing performance across many data types is preferred.

Fig. 9.
figure 9

Accumulo implementation of the GeoWave common model

Regardless of the data type, we adapt this same persistence model to index it. This ensures that all GeoWave features will work the same way regardless of the data type. This is what allows the system to seamlessly store and retrieve many disparate data types within the same table.

We have already discussed that one of the benefits of this indexing method is the ability to perform fine grained filtering of the data. In this example, we will explain why this is necessary. In most of our previous cases, the bounding box for the query perfectly overlapped a Hilbert value or “bucket.” In Fig. 10, that is not the case.

Fig. 10.
figure 10

Example of over-inclusion that will need to be filtered out in the fine-grained tuning

Since the bounding box overlaps a bucket, it will still “match” for that bucket since the potential to hit data is there. In this case, it would match for the range of 6 through 9. But we now need to make a second pass and check each bucket value to see if it overlaps the bounding box. In this case, we use the buckets as a rough filter to quickly reduce the amount of items we have to check. We then do a filter evaluation on the actual values at their native precision. In the case of complex geometric relationships, we use the popular Java Topology Suite (JTS) library to evaluate the filter. In this case, Point 1 and Point 2 both fall in bucket ranges of 6 through 9 and are evaluated using the fine-grained filtering process (but notice that Point 3 is ignored as it falls outside the range). BBOX overlaps Point 1 = false, so it is rejected. BBOX overlaps Point 2 = true, so it is included. It is important to note that all of this filtering is performed server side within the implementation constraints of the datastore. This allows the software to take advantage of a distributed cluster and parallelize the filtering process, as well as reduce unnecessary network traffic.

As a result, this reduces the time necessary to filter out false positives. Of note, this fine-grained filtering is only able to be performed across many disparate data types within a single scan on the server-side by using a definition of a common index model to store indexed values at native precision.

3.4 Key-Value Store Parity

GeoWave has established the goal of parity in function and performance among a variety of key-value stores. GeoWave’s user-facing tools and developer-facing APIs are agnostic to the underlying datastore. The connection configuration is the only difference. A named store is configured once, via the command line tools, and that named entity can then be used interchangeably with any of the supported key-value stores by subsequent commands. We accomplish this by drawing on the commonalities of distributed key-value stores and microbenchmarking the differences.

One such commonality is that each of the supported stores can perform range scans more efficiently on a well-sorted range of keys than on a discontinuous set. However, the way in which these keys are sorted may effectively match typical query use cases resulting in a behavior referred to as hotspotting. Hotspotting is a situation where processing and data are isolated to specific resources. This is obvious in a single tenant use case where the only resources with activity are the ones within a single filter constraint, but also occurs in highly multi-tenant environments due to current events and common trends in usage patterns. Each datastore has slightly different ways to more randomly distribute data. Apache Cassandra [3] and Amazon’s DynamoDB [1] for example force this random distribution, at least under default and recommended configuration, by requiring a “hash” key that does not have any implication on sort order. Additionally, they allow a “range” key that will benefit from sort order. With Apache Accumulo [2], Apache HBase [4], and Google’s Cloud BigTable [5], there is a single row key. BigTable is the most different as it optimizes the distribution of data implicitly. Apache Accumulo and Apache HBase, in practice, often use a concept called pre-splitting. Pre-splitting a table enables the user to define key ranges that will immediately fall within partition boundaries. Using this methodology a hash or random prefix can be associated with each entry that defines its partition and then a subsequent sort ordered suffix can be appended to this. In this way, randomization is achieved to the extent necessary to ensure proper resource utilization and range scans can effectively be performed within each partition. A user or developer within the GeoWave framework is able to define a general purpose indexing approach that includes a partitioning strategy and a multi-dimensional numeric data strategy. GeoWave keeps the sort order portion contributed by an index strategy separate from the portion that doesn’t imply sort order. Within Accumulo and HBase GeoWave can append the key appropriately. Within Cassandra and DynamoDB it can use the appropriate column type.

It cannot be expected that all behaviors within these datastores are similar. For example, DynamoDB and BigTable are externally hosted and the others are open source projects. We have already indicated how BigTable handles optimizing data distribution for the user, but DynamoDB also has interesting characteristics. For example, the payment model is very different – a user pays for provisioned throughput rather than defining a size for a cluster. This has direct implications on query optimizations within GeoWave. For example, if a user is throttled by read throughput the consideration to decompose a hyperrectangle into the most fine-grained set of ranges seems like an appropriate tradeoff, whereas if the user is not limited by read throughput it may be more efficient to minimize ranges (with the understanding that each range implies a separate REST invocation to AWS). Moreover, it is a reasonable expectation for each datastore to exhibit different behaviors for using many ranges. We will go into more detail on the implications of too many or too few ranges within the context of each datastore in Sect. 4.3.

4 Experimental Evaluation

4.1 Locality Preservation Performance

While the benefits of effective mapping between multi-dimensional objects and the sort order of keys within a distributed key-value store may be apparent to some, these benefits are easily demonstrable within the GeoWave framework. The cluster utilized in this benchmark is comprised of one master node and 20 data nodes; each node is an m3.xlarge instance type running in Amazon Web Services (AWS). In the following benchmark, we ingest data that is inherently four-dimensional with values for a three-dimensional coordinate space in X, Y, and Z, as well as a temporal component T. Each data entry also contains 500 bytes of extended metadata that is retrieved. One billion randomly generated data elements are ingested into GeoWave using HBase as the datastore and four different indexing definitions that exercise the underlying compact Hilbert curve to varying degrees of dimensionality. In the following graph, randomly generated queries of intentionally varied sizes are executed against each of the dimensional definitions. The time from defining query constraints to returning the last data element is plotted on the Y-axis on a logarithmic scale.

In Fig. 11, we can easily see the benefit of locality preservation of multi-dimensional objects. It is also apparent that this benefit can often be most impactful on more selective queries. On the query returning zero results, for example, it would be safe to assume that the single dimensional index filtered quite a bit out in the X dimension. But of what is left, there would be a GeoWave custom filter on the server side needed to evaluate and filter the false positives. While more effective than returning false positives to the client, it is considerably less effective than simply utilizing effective multi-dimensional representations for keys and ranges.

Fig. 11.
figure 11

(The source code to produce these results is publicly available (https://github.com/rfecher/dimensionality-experiments))

Graph of response times for queries of four-dimensional data on varying index types

It is worth noting that for relatively tightly constrained queries the 3D index behaved nearly identical to the 4D index. This is understood to imply that with very tightly constrained queries there are very few, if any, entries that are additionally constrained by the fourth dimension. However, as the queries become less restrictive there is a significant benefit in applying the additional constraints along the fourth dimension to effectively and completely filter the key space.

4.2 Map Pixel-Based Spatial Subsampling Performance

To demonstrate the power and time savings of GeoWave spatial subsampling, we will record the amount of time (in seconds) that it takes to render various numbers of GPS locations contributed to the OpenStreetMap (OSM) project with and without subsampling. This OSM GPX dataset is a crowd sourced set of GPS derived trajectories primarily used to determine connectivity of road networks. This data is open source and can be downloaded from: http://planet.openstreetmap.org/. In this example, we set up a 10 node distributed computing cluster (one Master and nine Workers) using AWS Elastic Map Reduce (EMR). We chose to use Accumulo for our datastore, which was backed by non-provisioned IOPS EBS volume. Each node used an m3.2xlarge EC2 instance and the global set of GPX data used for this test was 21 GB compressed, 256 GB uncompressed, and 153 GB in HDFS as an Accumulo database file after ingest. The rendering process was handled by the popular open source software GeoServer. Table 2 shows the results of the comparison.

Table 2. Effects of spatial subsampling on time taken to render various numbers of features

Each value shown in Table 2 is the average of 20 runs with the standard deviation reported afterwards. The “Out of memory” errors were on the GeoServer instance. The WMS process memory allocation was increased to 4 GB from the GeoServer default of 64 MB, but that was still not sufficient to prevent the error.

The results shown in Table 2 are as expected and the errors encountered further demonstrate the challenges of rendering this amount of data without techniques such as smart subsampling. When the dataset was quite small (e.g., hundreds of thousands of features), the spatial subsampling method was slower due to the impact of random seeks. As the number of features being rendered grew, however, the spatial subsampling method returned results in incredibly faster times. The traditional render method quickly became untenable. Even at the highest feature number that we were still successfully able to render using the traditional method, it took over six and a half minutes to finish.

4.3 Differences in Multi-range Scans Among Key-Value Stores

It is beyond the scope of this paper to comprehensively detail what a range scan implies on each datastore. But as a generalization, it is fair to say that an individual range scan within each datastore has a cost and it is unfair to assume that they behave the same in this respect. This is motivation for the following microbenchmark. As an example of the challenges faced in optimizing query range decomposition, let’s suppose that a sparsely populated table is queried using a particular set of 5000 ranges that well-represent a query hyperrectangle. On this table it may intersect 1–5 entries per range, but on a different, more densely populated, table this same set of ranges intersects thousands of entries per range. Considering the first table is sparsely populated in this hyperrectangle, it would additionally be fair to assume that combining similar ranges to end up with 10 ranges will introduce no, or very few, false positives. It is a considerable advantage given the cost of each range in the sparsely populated table to combine the ranges. However, the effect of combining the ranges down to only 10 ranges to represent the hyperrectangle in the densely populated table is of considerable impact in the loss in specificity of the keys. The false positives produced far outweigh any cost of additional ranges. Some of this can be informed by the histograms of data that GeoWave stores as a statistic on each write, but this is an example of the complex tradeoffs involved that necessitate the following microbenchmarks.

The impacts of multi-range scans are benchmarked on a cluster with one m3.xlarge master node and five m3.xlarge data nodes for Accumulo, HBase, and Cassandra. The client accessing DynamoDB is an m3.xlarge instance with sufficiently high read throughput that the scan is not being throttled by AWS. BigTable is not included in this particular benchmark in part because BigTable optimizes data placement over time to distribute reads equally across all nodes. This is a challenge to control in a benchmark. Furthermore, BigTable’s client API appears to be evolving with multi-range scan support in the near future (currently a separate scan must be executed for each range).

This microbenchmark is performed independently of GeoWave. It writes 100 million lexicoded numbers as keys to the underlying datastore where the values are a random set of 500 bytes. In DynamoDB and Cassandra, the default hash key load balancing is used with a single constant hash key and the lexicoded numbers are used for the range key. It executes scans with varying numbers of ranges. To be precise, the scans are repeated using powers of 10 between 1 and 100 million for the total number of ranges and either 10 million or 100 million total results are expected. The cluster size and volume of data in this microbenchmark were chosen to demonstrate the theoretical trends that should hold as compute resources are proportionally scaled with respect to data volume. The trend was consistent, and the median response times are plotted in Fig. 12 with logarithmically scaled axes.

Fig. 12.
figure 12

(The source code to produce these results is publicly available (https://github.com/rfecher/range-sensitivity-experiments))

Plot of results from the multi-range scan benchmark on multiple datastores

The relative difference between local minima and maxima across data stores is unimportant for this purpose and can be in part explained by varying degrees of configuration to appropriately control the experiment. As one example of this, index and block caching are disabled to avoid the effects of higher variance in repeated sampling. The important conclusion that we have drawn is that as the ratio of results per range decreases on varying sizes of multi-range scans, there is clearly an inflection point in the amount of time per entry. Moreover, the observed inflection point is different for each datastore. This is generalized to be considered a result of implicit costs for each individual range and implies that some datastores operate more efficiently on bulk multi-range scans if you have at least 10, 100, or 1000 entries per range. This is much less efficient in the extreme case if each range intersects a single entry. This is typically not as problematic in range scans on a single numeric dimension as you would likely be able to well-represent a query with a single, or at most, a few ranges. As the number of dimensions increases, the number of ranges must also increase to represent bounding hyperrectangles and it becomes critical to understand the performance characteristics of many ranges within each datastore.

5 Conclusions

In this work, we discussed how we leverage the GeoWave framework to provide efficient solutions to many of the implicit challenges in utilizing distributed computing frameworks with geospatial and temporal data. By using a tiered and binned SFC indexing system to index multidimensional data into single dimensional key-value stores, we are able to preserve locality in every dimension. We implement techniques like spatial subsampling to minimize the common bottleneck created at the map-rendering engine. By using an agnostic common indexing model, GeoWave allows for heterogeneous data types to be stored and analyzed in the same table. GeoWave has proven to be an incredibly powerful tool to bridge the gap between distributed computing and GIS. It has helped to lower the learning curve for these technologies and provides a fast, resource-efficient framework that is compatible with many of the most popular and best researched distributed datastores. Research is currently being done to further increase the number of compatible datastores as well as to add new analytic capabilities to the framework.