Keywords

1 Introduction

Every year natural hazards are responsible for powerful and extensive damage to people, property, and the environment. Drastic population growth, especially along coastal areas or in developing countries, has increased the risk posed by natural hazards to large, vulnerable populations at unprecedented levels (Tate and Frazier 2013). Furthermore, unusually strong and frequent weather events are occurring worldwide, causing floods, landslides, and droughts affecting thousands of people (Smith and Katz 2013). A single catastrophic event can claim thousands of lives, cause billions of dollars of damage, trigger a global economic depression, destroy natural landmarks, render a large territory uninhabitable, and destabilize the military and political balance in a region (Keilis-Borok 2002). Furthermore, the increasing urbanization of human society, including the emergence of megacities, has led to highly interdependent and vulnerable social infrastructure that may lack the resilience of a more agrarian, traditional society (Wenzel et al. 2007). In urban areas, it is crucial to develop new ways of assessing damage in real-time to aid in mitigating the risks posed by hazards. Annually, the identification, assessment, and repair of damage caused by hazards requires thousands of work hours and billions of dollars.

Remote sensing data are of paramount importance during disasters and have become the de-facto standard for providing high resolution imagery for damage assessment and the coordination of disaster relief operations (Cutter 2003; Joyce et al. 2009). First responders rely heavily on remotely sensed imagery for coordination of relief and response efforts as well as the prioritizing of resource allocation.

Determining the location and severity of damage to transportation infrastructure is particularly critical for establishing evacuation and supply routes as well as repair and maintenance agendas. Following the Colorado floods of September 2013 over 1000 bridges required inspection and approximately 200 miles of highway and 50 bridges were destroyed.Footnote 1 A variety of assessment techniques were utilized following Hurricane Katrina in 2005 to evaluate transportation infrastructure including visual, non-destructive, and remote sensing. However, the assessment of transportation infrastructure over such a large area could have been accelerated through the use of high resolution imagery and geospatial analysis (Uddin 2011; Schnebele et al. 2015).

Despite the wide availability of large remote sensing datasets from numerous sensors, specific data might not be collected in the time and space most urgently required. Geo-temporal gaps result due to satellite revisit time limitations, atmospheric opacity, or other obstructions. However, aerial platforms, especially Unmanned Aerial Vehicles (UAVs), can be quickly deployed to collect data about specific regions and be used to complement satellite imagery. UAVs are capable of providing high resolution, near real-time imagery often with less expense than manned aerial- or space-borne platforms. Their quick response times, high maneuverability and resolution make them important tools for disaster assessment (Tatham 2009).

Contributed data that contain spatial and temporal information can provide valuable Volunteered Geographic Information (VGI), harnessing the power of ‘citizens as sensors’ to provide a multitude of on-the-ground data, often in real time (Goodchild 2007). Although these volunteered data are often published without scientific intent, and usually carry little scientific merit, it is still possible to mine mission critical information. For example, during hurricane Katrina, geolocated pictures and videos searchable through Google provided early emergency response with ground-view information. These data have been used during major events, with the capture, in near real-time, of the evolution and impact of major hazards (De Longueville et al. 2009; Pultar et al. 2009; Heverin and Zach 2010; Vieweg et al. 2010; Acar and Muraki 2011; Verma et al. 2011; Earle et al. 2012; Tyshchuk et al. 2012).

Volunteered data can be employed to provide timely damage assessment, help in rescue and relief operations, as well as the optimization of engineering reconnaissance (Laituri and Kodrich 2008; Dashti et al. 2014; Schnebele and Cervone 2013; Schnebele et al. 2014a,b). While the quantity and real-time availability of VGI make it a valuable resource for disaster management applications, data volume, as well as its unstructured, heterogeneous nature, make the effective use of VGI challenging. Volunteered data can be diverse, complex, and overwhelming in volume, velocity, and in the variety of viewpoints they offer. Negotiating these overwhelming streams is beyond the capacity of human analysts. Current research offers some novel capabilities to utilize these streams in new, groundbreaking ways, leveraging, fusing and filtering this new generation of air-, space- and ground-based sensor-generated data (Oxendine et al. 2014).

2 Data

Multiple sources of contributed, remote sensing, and open source geospatial data were collected and utilized during this research. All the data are relative to the 2013 Colorado floods, and were collected between Septmber 11 and 17, 2013. Most of the data were collected in real time, as the event was unfolding. A summary of the sources and collection dates of the contributed and remote sensing data is available in Table 1.

Table 1 Sources and collection dates of contributed and remote sensing data

2.1 Contributed Data

2.1.1 Twitter

The social networking site Twitter is used by the public to share information about their daily lives through micro-blogging. These micro-blogs, or ‘tweets’, are limited to 140 characters, so abbreviations and colloquial phrasing are common, making the automation of filtering by content challenging. Different criteria are often applied for filtered and directed searches of Twitter content. For example, a hashtag Footnote 2 is an identifier unique to Twitter and is frequently used as a search tool. The creation and use of a hashtag can be established by any user and may develop a greater public following if it is viewed as useful, popular, or providing current information. Other search techniques may use keywords or location for filtering.

Twitter data were collected using a geospatial database setup at The Pennsylvania State University (PSU) GeoVISTA center. Although the volume of Twitter data streams is huge, only a small percentage of tweets contain geolocation information. For this particular study, about 2000 geolocated tweets were collected with hashtag #boulderflood.

2.1.2 Photos

In addition, a total of 80 images relating to the period from 11–14 September 2013 were downloaded through the website of the city of Boulder (https://bouldercolorado.gov/flood). These images did not contain geolocation information, but they included a description of when and where they were acquired. The google API was used to convert the spatial description to precise longitude and latitude coordinates.

While in the current research the data were semi-manually georectified, it is possible to use services such as Flickr or Instagram to automatically download geolocated photos. These services can provide additional crucial information during emergencies because photos are easily verifiable and can contain valuable information for transportation assessment.

2.2 Remote Sensing

2.2.1 Satellite

Full-resolution GeoTIFF multispectral Landsat 8 OLI/TIRS images that were collected on May 12, 2013 and on September 17, 2013 provided data of the Boulder County area before and after the flooding, respectively. The data were downloaded from the USGS Hazards Data Distribution System (HDDS). Landsat 8 consists of nine spectral bands with a resolution of 30 m: Band 1 (coastal aerosol, useful for coastal and aerosol studies, 0.43–0.45 μm); Bands 2–4 (optical, 0.45–0.51, 0.53–0.59, 0.64–0.67 μm), Band 5 (near-IR, 0.85–0.88 μm), Bands 6 and 7 (shortwave-IR, 1.57–1.65, 2.11–2.29 μm) and Band 9 (cirrus, useful for cirrus cloud detection, 1.36–1.38 μm). In addition, a 15 m panchromatic band (Band 8, 0.50–0.68 μm) and two 100 m thermal IR (Bands 10 and 11, 10.60–11.19, 11.50–12.51 μm ) were also collected from Landsat 8 OLI/TIRS.

2.2.2 Aerial

Aerial photos collected by the Civil Air Patrol (CAP), the civilian branch of the US Air Force, captured from 14–17 September 2013 in the areas surrounding Boulder (\( 105.5364\hbox{--} 104.992{5}^{\circ } \) W longitude and \( 40.26031\hbox{--} 39.9360{2}^{\circ } \) N latitude) provided an additional source of remotely sensed data. The georeferenced Civil Air Patrol RGB composite photos were downloaded from the USGS Hazards Data Distribution System (HDDS).

2.3 Open Source Geospatial Data

Shapefiles defining the extent of the City of Boulder and Boulder County were downloaded from the City of BoulderFootnote 3 and the Colorado Department of Local AffairsFootnote 4 websites, respectively. In addition, a line/line shapefile of road networks for Boulder County was downloaded from the US Census Bureau.Footnote 5

3 Methods

The proposed methodology is based on the fusion of contributed data with remote sensing imagery for damage assessment of transportation infrastructure.

3.1 Classification of Satellite Images

For the Colorado floods of 2013, supervised machine learning classifications were employed to identify water in each of the satellite images. Water pixels are identified in both Landsat images by using a decision tree induction classifier. Ripley (2008) describes the general rule induction methodology and its implementation in the R statistical package used in this study. In the near-IR water is easily distinguished from soil and vegetation due to its strong absorption (Smith 1997). Therefore, imagery caught by Landsat’s Band 5 (near IR, 0.85–0.88 μm) was used for the machine learning classification. Control areas of roughly the same size are identified as examples of water pixels, ‘water’, and over different regions with no water pixels as counter-examples, or ‘other’. Landsat data relative to these regions are used as training events by the decision tree classifier. The learned tree is then used to classify the remaining water pixels in the scene. This process is repeated for both the May and September images.

3.2 Spatial Interpolation

Satellite remote sensing data may be insufficient as a function of revisit time or obstructions due to clouds or vegetation. Therefore, data from other sources can be used to provide supplemental information. Aerial remote sensing data as well as contributed data, or VGI, from photos and tweets were used to capture or infer the presence of flooding in a particular area.

Utilizing different types and sources of data, this research aims to extract as much information as possible about damage caused by natural hazards. Environmental data are often based on samples in limited areas, and the tweets analyzed are approximately only 1 % of the total tweets generated during the time period. This is usually referred to as the ‘Big Data paradox’, where very large amounts of data to be analyzed are only a small sample of the total data, which might not reflect the distribution of the entire population.

In addition, the absence of data in some parts of the region is likely to underestimate the total damage. In order to compensate for the missing data, the spatio-temporal distribution of the data were analyzed by weighting according to the spatial relationships of the points (Tobler 1970, p. 236). This assumes some levels of dependence among spatial data (Waters 2017). For these reasons a punctual representation of data may not be sufficient to provide a complete portrayal of the hazard, therefore a spatial interpolation is employed to estimate flood conditions and damage from point sources.

Spatial interpolation consists of estimating the damage at unsampled locations by using information about the nearest available measured points. For this purpose, interpolation generates a surface crossing sampled points. This process can be implemented by using two different approaches: deterministic models and statistical techniques. Even if both use a mathematical function to predict unknown values, the first method does not provide an indication of the extent of possible errors, whereas the second method supplies probabilistic estimates. Deterministic models include IDW(Inverse Distance Weighted), Rectangular, NN (Natural Neighbourhood) and Spline. Statistical methods include Kriging (Ordinary, Simple and Universal) and Kernel. In this project, Kernel interpolation has been used.

Kernel interpolation is the most popular non-parametric density estimator, that is a function \( \widehat{p}:\Re \times {\left(\Re \right)}^N\to \Re \). In particular it has the following aspect:

$$ \widehat{p}(x)=\frac{1}{Nh}{\displaystyle \sum_{i=1}^N}K\left(\frac{x-{x}_i}{h}\right). $$
(1)

where K(u) is the Kernel function and h is the bandwidth (Raykar and Duraiswami 2006). There are different kinds of kernel density estimators such as Epanechnikov, Triangular, Gaussian, Rectangular. The density estimator chosen for this work is a Gaussian kernel with zero mean and unit variance having the following form:

$$ \widehat{p}(x)=\frac{1}{N\sqrt{2\pi {h}^2}}{\displaystyle \sum_{i=1}^N}{e}^{-{\left(x-{x}_i\right)}^2/2{h}^2}. $$
(2)

Kernel interpolation is often preferred because it provides an estimate of error as opposed to methods based on radial basis functions. In addition, it is more effective than a Kriging interpolation in cases of small data sets (for example, the data set of photos in this project) or data with non-stationary behavior (all data sets used in this work) (Mühlenstädt and Kuhnt 2011).

In general, spatial interpolation is introduced to solve the following problems associated with histograms:

  • The wider the interval, the greater the information loss;

  • Histograms provide estimates of local density (points are “local” to each other if they belong to the same bin) so this method does not give prominence to proximity of points;

  • The resulting estimate is not smooth.

These problems can be avoided by using a smooth kernel function, rather than a histogram “block” centered over each point, and summing the functions for each location on the scale. However, it is important to note that the results of kernel density interpolations are very dependent on the size of the defined interval or “bandwidth”. The bandwidth is a smoothing parameter and it determines the width of the kernel function.

The result of a kernel density estimation will depend on the kernel K(u) and the bandwidth h chosen. The former is linked to the shape, or function, of the curve centered over each point whereas the latter determines the width of the function. The choice of bandwidth will exert greater influence over an interpolation result than the kernel function. Indeed, as the value of h decreases, the local weight of single observations will increase.

Because confidence in data may vary with source characteristics, bandwidth selection can be varied for each data type. Generally, as certainty, or confidence, in a given data source increases so does the chosen bandwidth. For example, aerial images can be verified visually and therefore are considered to be more credible information sources. By contrast some tweets could not be useful since they are subjective; indeed some may only contain users’ feelings and not information related to damage caused by the hazard. For this reason a lower bandwidth for tweets has been chosen in this work.

There are different methods for choosing an appropriate bandwidth. For example, it can be identified as the value that minimizes the approximation of the error between the estimate \( \widehat{p}(x) \) and the actual density p(x) as explained in Raykar and Duraiswami (2006). In this work, a spatial Kernel estimation has been used because all considered data sets consist of points on the Earth’s surface. In case of d-dimensional points the form of the Kernel estimate is:

$$ \widehat{p}\left(x;H\right)={n}^{-1}{\displaystyle \sum_{i=1}^N}{K}_H\left(x-{X}_i\right) $$
(3)

where:

$$ \begin{array}{c}x={({x}_1,{x}_2,\dots, {x}_d)}^T\\ {}{X}_i={({X}_{i1},{X}_{i2},\dots, {X}_{id})}^T\\ {}i=1,2,\dots, N\\ {}{K}_H(x)=|H{|}^{-1/2}K({H}^{-1/2}x)\\ {}\end{array} $$

In this case K(x) is the spatial kernel and H is the bandwidth matrix, which is symmetric and positive-definite. As in the uni-dimensional case, an optimal bandwidth matrix has to be chosen, for example, using the method illustrated in Duong and Hazelton (2005). In this project, data have been interpolated by using the R command smoothScatter of the package graphics based on the Fast Fourier transform. It is a variation of regular Kernel interpolation that reduces the computational complexity from O(N 2) to O(N). The O (‘big O’) notation is a Computer Science metric to quantify and compare the complexity of algorithms (Knuth 1976). O(N) indicates a linear complexity, whereas O(N 2) indicates a much higher, quadratic, complexity. Generally, the bandwidth is automatically calculated by using the R command bkde2D of the R package KernSmooth. However, the bandwidth for tweets interpolation has been specified because information related to them has a lower weight.

4 Analysis and Results

4.1 Damage Assessment

Using a supervised machine learning classification as discussed in Sect. 3.1, water pixels were identified in these images. For example, a comparison of the classifications in the Landsat 12 May, 2013 ‘before’ image (Fig. 1a) and the Landsat 17 September, 2013 ‘after’ image (Fig. 1b) illustrates additional water pixels classified in the ‘after’ image associated with the flood event. One of the challenges associated with classifying remote sensing imagery is illustrated in (Fig. 1b) where clouds over the front range in Colorado are misclassified as water pixels. In addition, because the Landsat ‘after’ image was collected on 17 September, 7 days after the beginning of the flood event, it is likely that the maximum flood extent is not captured in this scene.

Fig. 1
figure 1

Water pixel classification using Landsat 8 data collected 12 May 2013 (a) and 17 September 2013 (b). The background images are the Landsat band 5 for each of the 2 days

Following the collection of remote sensing data, contributed data were also collected, geolocated, and interpolated following the methods discussed in Sect. 3.2. The interpolated pixels were then overlayed on the remote sensing classification to give an enhanced indication of flood activity in the Boulder area (Fig. 2). The use of supplemental data sources, such as the Civil Air Patrol (CAP) aerial photos, shows flooding in areas that was not captured by the satellite remote sensing. In the case of the Boulder flooding, the cloud cover over the front range and western parts of the City of Boulder, made the identification of water from satellite platforms difficult. The ability of planes to fly below cloud cover as well as to collect data without the revisit limitations common to space-borne platforms, allowed the CAP to capture flooding and damage not visible from satellite images in the western parts of Boulder County (Fig. 2).

Fig. 2
figure 2

Classified Landsat 8 image collected 17 September 2013 image and interpolated ancillary data give an indication of flood activity around the Boulder, CO area. While some data sources overlap, others have a very different spatial extent

4.2 Transportation Classification

Although the identification of water in the near-IR is a standard technique, the supervised classification of the Landsat data did not indicate any pixels as ‘water’ in the City of Boulder (Fig. 1b). This could be because the image was collected on 17 September, a week after the flood event began, and flood waters could have receded, as well as the presence of obstructing vegetation and cloud cover. However, it is interesting to note that contributed data such as photos and tweets do indicate the presence of flooding in the City of Boulder. Using the geolocated tweets (n=130) containing the hashtag ‘#boulderflood’ geolocated near Boulder \( (105.081{4}^{\circ}\hbox{--} 105.297{2}^{\circ } \) W longitude and \( 40.0994{7}^{\circ}\hbox{--} 39.9534{3}^{\circ } \) N latitude) as well as geolocated photos (n=80), flood activity is indicated by local citizens in the downtown area. While there may be uncertainties associated with information obtained from tweets, the presence of flooding and damage are more easily verified in photos.

Using contributed data points (Fig. 3a), a flooding and damage surface is interpolated using a kernel density smoothing application as discussed in Sect. 3.2 for the downtown Boulder area. After an interpolated surface is created from each data set (tweets and photos), they are combined using a weighted sum overlay approach. The tweets layer is assigned a weight of 1 and the photos layer a weight of 2. A higher weight is assigned to the photos layer because information can be more easily verified in photos, therefore there is a higher level of confidence in this data set. The weighted layers are summed yielding a flooding and damage assessment surface created solely from contributed data (Fig. 3b). This surface is then paired with a high resolution road network layer. Roads are identified as potentially compromised or impassable based on the underlying damage assessment (Fig. 3c). In a final step, the classified roads are compared to roads closed by the Boulder Emergency Operations Center (EOC) from 11–15 September 2013 (Fig. 3d).

Fig. 3
figure 3

Using contributed data geolocated in the downtown Boulder area (a), an interpolated damage surface is created (b) and when paired with a road network, classifies potentially compromised roads (c). Roads which were closed by the Boulder Emergency Operations Center (EOC) that were also classified using this approach (d)

5 Conclusions

Big data, such as those generated through social media platforms, provide unprecedented access to real-time, on-the-ground information during emergencies, supplementing traditional, standard data sources such as space- and aerial-based remote sensing. In addition to inherent gaps in remote sensing data due to platform or revisit limitations or atmospheric interference, obtaining data and information for urban areas can be especially challenging because of resolution requirements. Identifying potential damage at the street or block level provides an opportunity for triaging site visits for evaluation. However, utilizing big data efficiently and effectively is challenging owing to its complexity, size, and in the case of social media, heterogeneity (variety). New algorithms and techniques are required to harness the power of contributed data in real-time for emergency applications.

This paper presents a new methodology for locating natural hazards using contributed data, in particular Twitter. Once remote sensing data are collected, they, in combination with contributed data, can be used to provide an assessment of the ensuing damage. While Twitter is effective at identifying ‘hot spots’ at the city level, at the street level other sources provide a supplemental source of information with a finer detail (e.g. photos). In addition, remote sensing data may be limited by revisit times or cloud cover, so contributed ground data provide an additional source of information.

Challenges associated with utilizing contributed data, such as questions related to producer anonymity and geolocation accuracy as well as differing levels in data confidence make the application of these data during hazard events especially challenging. In addition to identifying a particular hazard, in this case flood waters, by pairing the interpolated damage assessment surface with a road network creates a classified ‘road hazards map’ which can be used to triage and optimize site inspections or tasks for additional data collection.