Keywords

1 Introduction

Pedon data consist of field estimates, observations, and laboratory measurements. Unlike the soil map unit polygons and their associated attribute data (component data), pedon data represent point data from individual soil observations. In support of soil surveys during the last 100 years, the National Cooperative Soil Survey (NCSS) has collected a substantial amount of pedon data. Since the introduction of the National Soil Information System (NASIS) in 1994 (Fortner and Price 2012), approximately 400,000 field pedons and approximately 63,000 laboratory pedons have been digitized (Ferguson, 2015, personal communication). Although significant, this represents only a small portion of total field pedons ever described (Fig. 17.1). For digital soil mapping and updates to soil surveys, these pedon data are an invaluable resource.

Fig. 17.1
figure 1

Number of pedons sampled per decade recorded in NASIS

In order to store soil data compactly and efficiently, NASIS has a hierarchical data structure (Fig. 17.2). One branch of the data structure stores point data—observations of site and pedon data, with soil horizons as the basic element. Aggregated data about soil map units and their soil components are stored in another part of the structure. Each aggregated soil component is made up of generalized soil horizons based on a sample of pedon observations. Also linked to each horizon record are additional child tables. Each of these nested child tables may include several related child tables in order to capture heterogeneous soil conditions within each soil horizon. The dominant condition is specified as the representative value (RV). For numeric component data, it is also possible to specify a range with low (L) and high (H) values. This makes it possible to characterize the distribution or variation of a particular soil variable, such as clay content. Using this database structure, it is possible to capture soil horizonation, aggregate the data, and then generate spatial predictions by linking it to the soil polygons.

Fig. 17.2
figure 2

Screenshot of the NASIS database interface, and the component and laboratory tables

Soil mapping involves aggregating horizon descriptions from field and laboratory pedons into component horizon data. While there are standards that guide the process of describing individual sites and pedons in the Soil Survey Manual (Soil Survey Division Staff 1993) and the Field Book for Describing and Sampling Soils (Schoeneberger et al. 2012), there are no guidelines for the process of aggregating point/pedon observations into their component database elements. The NCSS guidelines either address developing Official Series Descriptions (OSDs) (USDA 2015), or how component ranges relate to the OSD (USDA 2013). Historically, the process of determining the ranges (L, RV, H) for various soil properties has been done with pencil and paper or spreadsheets and then selected by expert knowledge. This is a practice that continues today for a variety of reasons:

  1. 1.

    Familiarity with existing protocol,

  2. 2.

    Inconsistency among the existing data,

  3. 3.

    Additional workload involved in digitizing data,

  4. 4.

    Perceived or real software limitations,

  5. 5.

    Lack of training in new software and statistical methods.

Prior to the advent of NASIS, there were many early attempts at estimating low, RV, and high values for soil properties (Young et al. 1991; Jansen and Arnold 1976). These earlier attempts looked at estimates for portions of the soil profile, such as surface texture or subsoil clay content, and utilized parametric estimates (i.e., mean and confidence intervals). They also demonstrated the disconnect between the limits set for taxonomic units and those observed within map unit components. This issue is now addressed by Soil Survey Technical Note 4 (USDA 2003), which allows the range (i.e., low and high) of map unit components to extend beyond those specified by the OSD.

It is possible to manipulate and summarize pedon data directly in NASIS with reports and pivot tables, but the majority of summary functions within NASIS have been designed to analyze and evaluate component-level aggregate data. Data can be exported from NASIS to other software (Table 17.1), but these other software do not provide the same concise summary of data as do the reports designed for component data in NASIS. New reports can be added to NASIS, but complex reports are difficult to write because NASIS supports a limited implementation of the Structured Query Language (SQL) which has few functions for performing statistical analysis. Here, we advocate exporting pedon data to R (R Core Development Team 2015). R now supports R Markdown (Rmd) reports that provide access to report-writing capabilities (Xie 2014; Allaire et al. 2015) and user-contributed functions specifically designed for digital soil morphometrics, such as the aqp (Beaudette et al. 2012), soilDB (Beaudette and Skovlin 2015), and soil texture (Moeys 2015) packages.

Table 17.1 Sample of tools for analyzing soil data sorted by user sophistication
Table 17.2 Number of the GHL versus the original horizon designations for the Miami series

2 Methods

To generate Markdown documents, RStudio was used. RStudio is an integrated development environment (IDE) for R and provides a minimalist graphical user interface (GUI) that organizes the R environment into four task-oriented windows. The initial start-up process of using RStudio and R to run the reports requires the user to install several R packages and their dependencies and setup an ODBC connection to NASIS. These steps are documented online at the NRCS Soils job-aid page, and readers are pointed to these reference documents for full details. R is an extendable environment and is in constant development, so installing additional packages is a common practice as packages are updated or new packages become available.

In order to access NASIS data for use in R, a user must first load a selected set of field or laboratory pedons in NASIS. A selected set is a view or virtual table that is created via a query, and serves as a working subset of a user’s local NASIS database. NASIS has numerous queries to accomplish this. Once the data is loaded in NASIS, it can be imported into R via an ODBC connection using the fetchNASIS() function in the soilDB package. The user only needs to modify the report script by entering the name of the text file (e.g., “Miami”) containing the GHL rules that correspond to the pedons loaded in the selected set. The report script is then run, and an HTML document is generated by pressing the Knit button in RStudio. The necessary analysis steps are programmed into the report script, and the output is formatted to HTML using Rmd.

To develop a list of GHL, the user must specify which horizons are similar enough to be aggregated (Fig. 17.3). This is accomplished by mapping the existing horizon designations for each horizon and matching them to a generalized (i.e., simplified) horizonation sequence for each soil series or component. The assumption is made that the existing horizon designations accurately reflect the soil morphology and the corresponding soil properties of the horizons. For established soil series, the Official Series Description (OSD) can be used as a starting point for determining the appropriate GHL to assign to the horizons for the soil in question. The OSD provides a sample of likely horizons within either the typical pedon described or the range in characteristics (RIC) sections. For example, multiple Bt horizons might be aggregated or grouped together if it is determined that they are similar in clay content and other characteristics and that such an aggregation is not going to affect the use or interpretation of that soil. Also, Bw and Btk horizons might be aggregated if the development of the Btk horizons is incipient and does not meet the definition of an argillic or calcic diagnostic horizon. Another approach is to examine the frequency with which each horizon occurs (Fig. 17.4). Horizons that occur frequently are likely to be the most representative.

Fig. 17.3
figure 3

Hand drawn illustration of the decision making (e.g., question asking) process soil scientists go through when determining the best selection of GHL for several similar soil descriptions

Fig. 17.4
figure 4

Example of the original horizon designations sorted by frequency of occurrence for the Miami soil series

Once appropriate GHL have been determined for the collection of pedons, pattern matching is used to assign the new GHL to each horizon. The process uses functions designed to parse the text from each horizon designation and match it to the new GHL. The function searches for any combination of characters before or after the specified pattern. Patterns that do not match any of the GHL are labeled “not used.” Special meta-characters serve as anchors or anti-wildcards for the beginning (i.e., caret “^”) and end (i.e., dollar sign “$”) of the given pattern. For example, the GHL pattern “Bt” will match any permutation of Bt, such as 2Bt or Bt1. To exclude 2Bt horizons, a more specific pattern of “^Bt” would be necessary. Conversely, to exclude Bt1 horizons, a pattern of “Bt$” would be used. If a user wishes to match special character like the caret “^” symbol, which is also used for human-transported material, it is necessary to append it with two backslashes like so, “\\^.” As the GHL rules are developed, they are stored in a text file and later referenced by the Rmd report. If the user is satisfied with the resulting GHL designations, they can upload it to the comp layer ID field in the horizon table in NASIS where it is stored for future use.

Example of the GHL rules for the aqp loafercreek sample data se t:

  • A: ^A$|Ad|Ap

  • Bt1: Bt1$

  • Bt2: ^Bt2$

  • Bt3: ^Bt3|^Bt4|CBt$|BCt$|2Bt|2CB$|^C$

  • Cr: Cr

  • R: R

Embedded in the reports are numerical and graphical summarizes of the data elements typically collected and used to differentiate dissimilar soils. Numerical variables are summarized by percentiles (i.e., quantiles), instead of the mean and confidence intervals, because they provide nonparametric estimates of a distribution and are less influenced by skewness which is common for most soil properties. Also percentiles provide a neat and compact summary. The percentiles used can be adjusted by the user, but the default is set to the five number summary (i.e., 0, 25, 50 % or median, 75, and 100 %) (Tables 17.3 and 17.4). Additionally, the percentiles are appended with the number of observations (n) (e.g., (0, 25, 50 % or median, 75, and 100 %)(n)), to inform the user of the sample size. The standard graphics used are box plots which provide a similar summary and interpretation (outliers, ~5, ~25, 50 % or median, ~75, ~95 %, outliers) of the data (Fig. 17.5). To summarize categorical variables, frequency tables (i.e., contingency tables) are used which cross-tabulate the number of occurrences of matching pairs (Tables 17.5 and 17.6).

Table 17.3 Percentile summaries of field estimates of clay (%) and pH
Table 17.4 Percentile summaries of laboratory measurements of clay (%) and pH
Fig. 17.5
figure 5

Box plots of field (top) and laboratory (bottom) measurements for clay (%) and pH

Table 17.5 Number of GHL versus field textures
Table 17.6 Number of GHL versus laboratory textures

3 Results and Discussion

The full field and laboratory reports are not shown here due to space limitations. The list below summarizes their content followed by sample excerpts and a discussion of the field and laboratory report content.

  • Field pedon report content:

    • General map of georeferenced pedon locations overlaid on county boundary outlines;

    • Table of identifying information: pedon id, soil series, etc.,

    • Soil profile plots (Fig. 17.6),

      Fig. 17.6
      figure 6

      Example of soil profile plots of the field (top) and laboratory (bottom) pedons for the Miami soil series. Horizons are colored according to their GHL

    • Surface rock fragments,

    • Depths and thickness of diagnostic horizons,

    • Comparison of GHL versus original horizon designations (Table 17.2),

    • Depth and thickness distribution of GHL,

    • Numeric variables: clay content, rock fragments, pH, etc., (Table 17.3)

    • Soil texture and texture class modifier summarized by GHL (Table 17.5),

    • Soil color hue summarized by GHL,

    • Elevation, slope gradient, and slope aspect,

    • Parent material versus landform,

    • Slope shape (down slope vs. across slope shape),

    • Drainage class versus hillslope position.

  • Laboratory pedon report content:

    • General map of georeferenced laboratory pedon locations overlaid on county boundary outlines,

    • Table of identifying information: pedon id, soil series, etc.,

    • Soil profile plots (Fig. 17.6),

    • Weighted averages for the particle size control section,

    • Depths and horizon thickness for the particle size control section,

    • Comparison of GHL versus original horizon designations (Table 17.2),

    • Depth and horizon thickness of GHL,

    • Numeric variables: particle size fractions, pH, base saturation, carbon content, etc. (Table 17.4),

    • Laboratory soil texture summarized by GHL (Table 17.6).

Much of the information contained in the reports is used to summarize data for developing OSD and aggregated map unit soil components. Evaluating the graphics and tables within the reports quickly show where there are possible errors, narrow or wide ranges in values, or where data gaps exist due to insufficient data. One of the first outputs of the report that should be examined is the contingency table of the GHL versus the original horizon designations (Table 17.2). This shows the results of the pattern matching and should be examined to confirm whether the GHL assignments aggregate the soil horizons appropriately. For example, GHL that are labeled as “not used” did not match any of the given patterns and were not included in the data summaries. The user may in some cases wish to further examine these horizons and decide whether or not to refine the GHL rules to include/exclude them from the summaries.

As an example, the following tables and figures show excerpts from all the field and laboratory data labeled as the Miami soil series within NASIS (Tables 17.3, 17.4, 17.5, and 17.6) (Figs. 17.2, 17.4, 17.5, and 17.6). The example shows that the field estimates of clay content are missing for A horizons. Given the age of the data set, which ranges from 1951 to 2014, this is not surprising, as it has not always beencommon practiceto record field estimates for clay content. The laboratory data by comparison have numerous measurements of clay content. By examining the box plots, we can see a clay increase in the Bt and 2Bt horizons and a decrease in the 2Cd horizon. The box plots for pH show a wide interquartile range and a slight decrease in the median pH with depth. The subsoil (i.e., 2BCt and 2Cd) shows a much narrow interquartile range and higher median pH. Examining the contingency tables of GHL versus texture, we can see a greater frequency of silty textures in the A and E horizons (Table 17.5 and 17.6). The Bt horizon has a higher frequency of clay loam textures. If silty textures are indicative of the loess cap associated with the Miami soil series, numerous Bt horizons should be relabeled as 2Bt horizons. The report’s summaries allow soil scientists to examine their data quickly particularly when the data are viewed in aggregate.

4 Conclusion

Here, we have presented an effort to efficiently analyze the large volume of soil horizon data present in the NASIS database. We have developed R Markdown reports that provide univariate summarizes of the data elements typically used to develop OSD and soil map unit components. Using the relational structure of the NASIS database combined with the extensible data handling and statistical analysis capabilities of R, it is possible to generate powerful graphical and tabular summaries for collections of pedon data bundled into one report. Summarizing pedon data by horizon is a critical and time-consuming step in the soil survey workflow. Because we can typically only investigate soil variability by examining several soil profiles and comparing multiple descriptions, viewing the data in aggregate allows us to approximate the representative values and ranges for soil horizons (i.e., polypedons), which are the building blocks of soil map unit components.