Introduction

Globally, a significant body of research has focussed on contrasting land-uses and management with a view to quantifying differences in SOC stocks in different agricultural systems (e.g. West and Post 2002; Dawson and Smith 2007). In Australia, the Soil Carbon Research Program (and subsequent National Soil Carbon Program) commenced in 2009 to this aim and has yielded numerous studies attempting to quantify such effects (Baldock et al. 2013; Sanderman et al. 2013; Wilson and Lonergan 2013; Rabbi et al. 2014). Whilst quantifying the effects of land-use and land-management on SOC stocks is an important undertaking for carbon accounting purposes (Richards 2001), it assumes that the changes in SOC associated with alteration of land-use and management are easily quantified. Unfortunately, detecting changes in SOC between land-use systems can be difficult within the necessary timeframes for C accounting.

This difficulty in successfully distinguishing SOC differences under varying land-uses may be (partially) attributable to the numerous, complex and at times conflicting factors influencing SOC dynamics. These influential factors were described by Jenny (1941) and include time, parent material, topography, climate and organisms (including humans). Although Jenny characterised these factors as being key to soil formation, they are all conceptually linked with SOC dynamics as SOC is intimately linked with soil development.

Human influence on SOC is therefore but one of many driving factors, so that the influence of land-use and management on SOC storage may be masked by the influence of other over-arching factors such as climate. For example, Jackson et al. (2002) reported overall gains in SOC after encroachment of grasslands by woody species in drier climates but losses in wetter climates. However, a greater understanding of the driving factors of SOC dynamics may enable de-trending SOC data from the various influential signals, thereby allowing detection of differences between land-uses and land management schemes previously thought to be undetectable. To this end it was recently noted that to improve future IPCC assessments ‘it is critical to understand the drivers of soil carbon dynamics in the models, as well as the real world’ (Todd-Brown and Luo 2014).

Globally, SOC storage has been linked with vegetation, climate and physico-chemical soil properties such as texture or soil type (Schlesinger 1977; Post et al. 1982; Burke et al. 1989; Jobbagy and Jackson 2000) and climate also appears most relevant to SOC storage at regional scales (Wiesmeier et al. 2013b). Topography, on the other hand, appears to become relevant at the sub-regional scale (Grimm et al. 2008; Davy and Koen 2013). In their landmark paper, Jobbagy and Jackson (2000) assessed not only SOC storage but also its vertical distribution, and found that the factors influencing total SOC content differed from those influencing SOC vertical distribution. Specifically, climate was most influential in shallow depths (0–20 cm), whereas soil texture (sand and clay content) became more important at greater depths (>20 cm). More recently, similar results were reported by Badgery et al. (2013) who found that the influence of climate is most pronounced near the surface (0–10 cm), but at 20–30 cm soil texture and mineralogy become more influential to SOC storage. This suggests that an in depth understanding of the drivers of SOC dynamics requires assessment at different depths.

Changes in land-use lead to alteration of SOC content near the soil surface (0–10 cm) (Young et al. 2005; Luo et al. 2010; Wilson et al. 2010, 2011). Theoretically, changes in land-use should also affect SOC stocks below this depth, as land-use change usually implies vegetation change (e.g. reforestation or clearing of land for agricultural use), and vegetation type is strongly linked to SOC storage and its vertical distribution (Jackson et al. 2000). However, despite international evidence of subsoil response to changing land-use and management (Wiesmeier et al. 2013a), response of SOC below the surface due to land-use changes is yet to be clearly demonstrated in Australian soils.

Similarly, changes in land management are generally reported to have the greatest influence on the SOC content nearest the soil surface (0–15 cm) (Cotching 2012; Sanderman et al. 2013; Wilson and Lonergan 2013), with effects being more difficult to detect deeper in the soil (Badgery et al. 2013). This has led to the suggestion that subsurface SOC stocks may better reflect historical rather than current land-management (Wilson and Lonergan 2013). Despite this, altered land-management practices have been reported to influence subsurface carbon stocks (Meersmans et al. 2009) and the stratification of SOC (Yang et al. 2008) on decadal timeframes internationally, and more research is needed to elucidate the effects of both land-management and land-use on SOC stocks at different depths.

A possible reason for the discrepancies between international and Australian findings regarding the effects of land-use and management on SOC storage is the unique nature of Australian soils. These are characterised by low SOC storage compared with soils internationally, which has been attributed to Australian climatic influences (low precipitation, high temperature) leading to low inputs of organic matter into the soil (Hassink 1997). Given the above-mentioned difficulties in detecting anthropogenic-driven SOC changes below the surface 10 cm in Australian soils, it is likely that—in contrast to many international soils—the active soil zone typically referred to as ‘topsoil’ many only comprise the surface 10 cm in Australian soils.

We hypothesise that the drivers of SOC storage vary with depth, namely that the factors influencing the total amount of SOC stored by a soil vary with depth and also differ from those influencing the vertical distribution of SOC. To test this hypothesis we used two analysis approaches—machine learning and classical parametric statistics—to assess the importance of various possible explanatory factors on SOC storage and vertical distribution to a depth of 30 cm. The explanatory factors were derived from site observations and GIS data. The SOC dataset comprised over 1400 sites from New South Wales (NSW), Australia, drawn from existing data relating to soils derived from the NSW Soil Monitoring Program and the National Soil Carbon Research Program.

Methodology

Study area

New South Wales is a state in eastern Australia, covering an area of 801 600 km2. Although lying within the global temperate zone, NSW encompasses a wide range of climates, with mean annual precipitation (MAP) ranging from <200 mm year−1 in the west to >1500 mm year−1 in the north-east of the state (Fig. 1). Rainfall generally increases from the west to east. The temperatures are mild overall but can be very hot in the north-west (>45 °C) and very cold in the alpine regions of the south-east (<0 °C). This variability in climate is reflected in the biodiversity of the state, with over 17 bioregions represented within the State, ranging from sandy deserts to lush rainforests (Sahukar et al. 2003). The State is traversed from north to south by the Great Dividing Range, which has a maximum elevation of >2200 m, and is a geographical boundary between the eastern seaboard and western centre.

Fig. 1
figure 1

The study region and location of sampling sites

The soils of NSW are highly diverse, with all 14 orders of the Australian Soil Classification (Isbell 2002) represented, such as siliceous sands (Tenosols, World Reference Base (WRB) Arenosol), cracking clays (Vertosols, WRB Vertisol) and water-influenced gleys (Hydrosols, WRB Gleysol) (Charman and Murphy 2007). The diversity of the soil landscape reflects the variation in climate, geology, topography and vegetation found within the State. This diversity in soil type and climate enables a range of agricultural land-uses, with the vast majority (74 %) of land used for grazing and cropping, with other uses being forestry and nature conservation (Australian Bureau of Statistics 2013). Overall the state of the soils of NSW is considered fair, but a noticeable, moderate decline in soil health is observable in the State (State of NSW Environmental Protection Authority 2012).

Soil and site data

The data presented here were merged from two separate datasets. The first was derived from the NSW soil condition Monitoring, Evaluation and Reporting (MER) Program. This dataset contains information on 800 sites and provides a provisional ‘baseline’ data layer gathered in 2008–09. The second dataset was from the Soil Carbon Research Program (SCRP), initiated across Australia in 2009 with the aim of assessing current carbon stocks across a range of representative agricultural land-use systems, focussing on those land-uses that are believed to hold the most promise for carbon storage. The work reported here was a part of the NSW component of the SCRP program, comprising a total of 741 sites.

Both datasets contained information regarding SOC concentration (%SOC), bulk density (ρ, in g cm−3) and gravimetric rock content (RM, dimensionless units), from which the C stocks (t ha−1) in a given sampling depth (cm) were calculated according to:

$$ {C}_{Stock} = {\%}_{SOC}\times \rho \times \left(1-RM\right)\times depth $$
(1)

Further information in the datasets related to the land-use at the time of sampling and the GIS co-ordinates of the site. This information was complimented with data which we believe to potentially influence SOC dynamics, namely variables describing climate, land-use and site such as soil type, elevation and geology (Table 1). The data were acquired from GIS data layers from different sources and where possible numerous descriptors describing the same or similar variables (e.g. maximum, minimum and average MAT, or numerous geological and lithological descriptors) were obtained, so as to reduce potential bias towards a single (set of) variable(s) or a single GIS data source.

Table 1 Predictor variables used to fit tree ensemble models. (C): continuous variable, F (n): categorical variable with n levels

Dataset collation and processing

A procedure was developed to merge the two datasets, which due to their different histories contained inconsistent sampling depth information. The MER project had sampled four depths: 0–5, 5–10, 10–20 and 20–30 cm, whereas within the SCRP project only three depths were sampled (0–10, 10–20, 20–30 cm). To unify the datasets, the MER data from the 0–5 and 5–10 cm were algebraically averaged, to create one depth corresponding to 0–10 cm. The appropriateness of this procedure was tested by comparing the carbon stocks calculated according to Eq. (1) for the 0–10 cm from the averaged values with those calculated as the sum of the C stocks in the individual depths (0–5 and 5–10 cm). The relationship between the two calculations was linear with Pearson’s product moment r > 0.99, so that the algebraic averaging was deemed sufficient for the purposes of the project.

Incomplete profiles were eliminated, as well as obviously erroneous values (e.g. SOC > 100 % or rock content = 100 %). The bulk density data were highly heterogeneous and at times implausible, ranging from 0.2 to 2.9 g cm−3 (which is greater than the density of granite), so that a statistical procedure to process the data was developed. Extreme bulk density values were identified using the interquartile range (IQR). Values above the limit set by the 3rd quartile plus 1.5 times the IQR (corresponding to a bulk density of >1.85 g cm−3 in the top 10 cm and >2.06 g cm−3 in the 20–30 cm depth) as well as values below the 1st quartile minus 3 times the IQR (corresponding to a bulk density of <0.33 g cm−3 in the top 10 cm, and <0.46 g/cm3 in the 20–30 cm depth) were omitted. The greater tolerance for extremely low bulk densities was due to the presence of organic soils (Organosols and Hydrosols, WRB Histosols) in the dataset, which would have been excluded using the same tolerance as for the upper limits. A total of 1401 data points remained in the processed dataset for analysis (Fig. 1).

SOC storage and vertical distribution

SOC storage was assessed using several variables, namely %SOC and Cstock in each depth increment (0-10, 10–20, 20–30 cm), as well as the sum of %SOC and Cstock over the three depth increments. Although assessed individually, we hereafter refer to %SOC and Cstock combined as SOC content. To assess the vertical distribution of SOC, numerous indicators were investigated, namely:

The proportion of C in the top 10 cm to the sum of C at the three depth intervals:

$$ {C}_{proportion}=\frac{SO{C}_{0-10\kern0.28em cm}}{{\displaystyle \sum } SO{C}_{0-30\kern0.28em cm}} $$
(2)

the ratio of C in the upper depth to lower depth:

$$ {C}_{ratio}=\frac{SO{C}_{0-10\kern0.28em cm}}{SO{C}_{20-30\kern0.28em cm}} $$
(3)

and the difference between C in the upper 10 cm and lower 10 cm normalised to the amount of C in the lower depth:

$$ \Delta {C}_{ratio}=\frac{SO{C}_{0-10\kern0.28em cm}- SO{C}_{20-30\kern0.28em cm}}{SO{C}_{20-30\kern0.28em cm}} $$
(4)

The Cproportion is best related to SOC in the top 10 cm (which contributes most to SOC in the 0–30 cm depth) and implies that the vertical distribution of SOC is strongly dependent on SOC near the surface (i.e. production). In contrast, the Cratio gives a greater importance to the SOC content below the surface, and implies that SOC ratio is more influenced by C retention in the soil. The ΔC ratio is an intermediate between these two measures. These three indices were calculated for both %SOC and Cstocks, and are hereafter collectively referred to as Cgradient.

Identifying factors important to SOC storage and vertical distribution

To identify the key drivers of SOC storage and vertical distribution a machine-learning procedure similar to a random forest (Breiman 2001) was employed. Numerous (500) regression trees were bagged to an ensemble and the variable importance of the predictors extracted to identify the drivers of C storage and vertical distribution. Tree ensembles were grown using the conditional inference forest algorithm (cforest command in the party 1.0–15 package, Hothorn et al. 2006) with the R language for statistical computing (version 3.1.1, R Core Team 2014). In growing a conditional inference tree, the decision to split a node is based upon the outcome of a test of the global null hypothesis of independence between the response variable and the predictor variables selected for splitting a node. If the null hypothesis can be rejected at the specified significance level (α = 0.05), the node is split. Otherwise, tree growth is terminated. This approach eliminates the possibility of overgrown trees (and therefore the need for tree pruning), and overcomes the bias of traditional random forests to split upon categorical variables with many factors, or along continuous variables with a broader scale (Strobl et al. 2007).

Tree ensembles were generated using natural and log-transformed data for both the above-mentioned C variables and the available (continuous) predictor variables (Table 1), resulting in three models (natural, log, mixed) for each response variable. Log-transformation of selected variables was undertaken to both stabilise variance in and linearise the relationship between response and (continuous) predictor variables. Log-transformation of a variable was applied where it led to an improvement in the correlation between response and predictor variable.

Model performance was assessed using explained variance defined by the coefficient of determination (R 2):

$$ {R}^2=1-\frac{MSE}{Variance} $$
(5)

with MSE the mean square error of the average of individual estimates for each tree in an ensemble, and Variance the variance of the modelled response variable. As our aim was not to produce predictive models but merely describe the factors influencing SOC storage and vertical distribution, we assessed model performance using the fitted, not predicted values.

From these models, each predictor’s variable importance, VI, (determined both from model accuracy and area under curve) was extracted from each model. Where numerous, highly collinear (Pearson’s product moment r > 0.9) predictor variables describing similar factors (e.g. MATmax, MATave and MATmin, or EVAP and VPD) were indicated as important in the models, the lower ranked were eliminated and new ensembles grown to produce final models without high correlation of important variables as this may affect variable importance results (Nicodemus et al. 2010). From these final models, VI was extracted and averaged over the models to generate the final results.

Assessing the influence of important factors influencing SOC storage and vertical distribution

To assess the influence of the important factors driving SOC storage and vertical distribution, the results of the tree ensemble data-mining exercise were used to inform classical, parametric-based data analyses. The aim hereby was not to develop predictive models, but to assess and compare the influence of those predictor variables indicated by the tree ensembles as important to SOC storage and vertical distribution.

For each response variable, multiple regressions were created using only those predictor variables whose relative importance, \( V{I}_{i,\ rel}=\frac{V{I}_i}{{\displaystyle {\sum}_1^kV{I}_i}}\times 100\% \), was greater than that expected from a theoretical model where all predictor variables are equally important (i.e. \( VI=\frac{1}{k}*100\% \) for k predictors in the tree ensembles). The models were created with a mixture of natural and log-transformed data, based upon a correlation analysis of the best relationships between response and predictor variables. Log-transformation was not applied to ∆C ratios due to the presence of negative values in the dataset. The models were built using the variable order indicated by the rank of the VI rel from the tree ensembles. An ANOVA of each regression model was then performed, and non-significant variables (p > 0.05) dropped from the models to build the final model. From these final models, the relative contribution of climate, site and land-use variables (Table 1) were calculated as the total sum of squares from the respective predictors divided by the model sum of squares.

The influence of continuous variables on SOC variables was assessed via the coefficients of each model. For categorical variables, a combination of partial regression, ANOVA and the Games-Howell post-hoc test (an extension of Tukey’s Honestly Significant Differences which adjusts for unbalanced groups and unequal variance) of the relevant variable was performed after controlling for other important variables. For example, to assess the importance of land-use on SOC, partial regressions of important land-use variables were performed whilst controlling for climate and site variables (i.e. the other important variables indicated by the tree ensembles) and the Games-Howell test used to detect significant differences between land-uses. To do this, multiple regression models were created for each SOC variable to its important environmental and site factors and the residuals from these models were then regressed to the important land-use factors and an ANOVA performed. The Games-Howell post-hoc analysis was then undertaken on the results to identify significant differences between land-use types. A comparatively conservative significance level of p < 0.01 was chosen due to the large number of observations in the dataset.

Results

SOC storage and vertical distribution

SOC concentration and stocks were characterised by large variance and non-normal distribution, exhibiting a positive skew. Cstock in the 0–30 cm ranged from 14 to 203 t ha−1 with a coefficient of variation (CV) of >50 % for the entire dataset (Table 2). Relative variance was even larger for %SOC with a CV around 80 %. At the site scale, the average %SOC CV was 24 % for the three depths (range 0.5–155 %), while the average for Cstocks was 29 % (range 0.5–156 %). Generally, SOC concentrations and stocks declined with depth, though not at all sites, indicated by the negative values observed as minima for ∆Cratios (Table 2). Of the Cgradients, the proportions had the lowest, the ratios the highest relative variance.

Table 2 SOC concentration, stocks and vertical distribution indicators

Tree ensemble models

The SOC content nearest the surface and integrated over all depths were best fit by the models, which explained 76 % of observed variance (Table 3). Explained model variance was poorer for the Cratios at 67 %. Model performance decreased with increasing soil depth.

Table 3 Performance of tree ensembles models

Bulk density and MAP were important explanatory variables for %SOC at all depths, and these variables were top ranked at all depths except 20–30 cm. In the top 10 cm, vapour pressure deficit (VPD) and mean annual relative humidity at 3 pm (MARH3pm) were important, but at 10–20 and 20–30 cm Great Soil Group (GSG) and the land-use recorded during sampling (LUorig) emerged as important splitting variables (Fig. 2). At a depth of 20–30 cm, LUorig was the most important variable in the models. Other factors identified as important in some models were the categories describing climate classes and seasonality (KoppenCode, ClmZone, SrnAll_Code—all depths), as well as the geological/lithological categories PlotSym and sdL (20–30 cm depth).

Fig. 2
figure 2

Relative variable importance from the regression tree models for SOC concentration and stocks. (a) %SOC at 0–10 cm, (b) %SOC at 20–30 cm, (c) Cstocks at 0–10 cm and (d) Cstocks 20–30 cm. The blue dotted line represents the expected variable importance in a model where all variables are equally ranked. For an explanation of variables, see Table 1

In contrast, bulk density was not an important splitting variable for Cstocks (Fig. 2). The variable importance rankings for Cstocks revealed LUorig, MAP, and KoppenCode as important variables for the models at all depths (0–10, 10–20, 20–30, and 0–30 cm). LUorig was the most important splitting variable at all depths, with the exception of the top 10 cm, where VPD was most important. Other important splitting variables were relative humidity (MARH3pm, all depths except 20–30 cm), soil type (GSG, all depths, though not in all models of 0–10 cm), seasonal rainfall (SrnAll_Code, all depths except 20–30 cm), geology (PlotSym) and rock content (20–30 cm) and lithology (Lith, some models at 20–30 cm).

Generally, the importance of climate variables and descriptors dominated in the top 10 cm, and with increasing depth land-use and geological descriptors became more important for both %SOC and Cstocks.

In contrast to the differences observable in variable importance rankings between %SOC and Cstocks, all measures of vertical distribution of SOC yielded LUorig as the most important predictor variable, with a normalised importance of up to 10 times greater than expected. Further important variables in all models were temperature, geology, Köppen class, and evaporation (MATmax,PlotSym, KoppenCode, EVAP, Fig. 3). In some models, soil type (GSG) and lithology (Lith) were indicated as being of marginal importance (normalised importance less than twice the expected value).

Fig. 3
figure 3

Relative variable importance from the regression tree models for vertical distribution indicators. (a) %SOC,proportion, (b) %SOC,ratio, (c) ∆% SOC,ratio, (d) Cstock proportion, (e) Cstock ratio, and (f) ∆C stock,ratio. The blue line represents the expected variable importance in a model where all variables are equally ranked. For an explanation of variables, see Table 1

Multiple regression analyses

Model performance and factors contributing to SOC storage, stocks and vertical distribution

The multiple regressions best explained the variance in %SOC, and performance declined for the Cstocks and the indicators of C vertical distribution. The amount of variance explained by the models declined with increasing soil depth (Table 4). Between 49 % (Cstock 20–30) and 74 % (%SOC 0–10) of total variance in the SOC variables was explained by the multiple regression models. Of the explained variance in the %SOC models, climate factors contributed the largest amount in the top 0–10 cm, accounting for 51 % of explained variance, with 35 % of attributable to site factors and 14 % to land-use. With increasing depth, the importance of climate factors to explained variance decreased and the relative contribution of both site and land-use factors increased, so that at 20–30 cm depth, climate factors contributed 8 % of explained variance in %SOC models, with 61 % attributable to site and 30 % attributable to land-use. The amount of variance attributable to the climate, site and land-use for the Cstocks differed, but the general pattern was similar, with climate influence dominating in the top 10 cm, and land-use and site becoming more important with increasing soil depth (Table 4).

Table 4 Multiple regression models

All models of the Cgradients (i.e., the indices representing the vertical distribution of SOC) yielded similar results, with 56–58 % of variance explained by the multiple regressions (Table 4). Land-use was the most important influence on SOC vertical distribution, accounting for 68 ± 4 % of explained variance, with climate accounting for 22 ± 6 % and site factors accounting for 10 ± 4 % (averages ± standard deviation of the individual Cgradient results).

Influences on SOC storage and stocks

Of the continuous variables included in the multiple regressions, MAP and MARH3pm were positively associated with %SOC and Cstocks, whereas VPD was negatively associated with both %SOC and Cstocks. Bulk density was negatively associated with %SOC.

Of the categorical variables, only Köppen climate classes (KoppenCode) and the land-use recorded during sampling (LUorig) are considered here, as they were important in all tree ensemble models and significant in the multiple regressions. Although GSG was significant in many models, the Games-Howell post-hoc analysis failed to identify significant differences between soil groups.

After controlling for other variables, the Games-Howell post-hoc analysis of %SOC and Cstocks to Köppen climate classes indicated generally greater SOC in temperate than in either subtropical, grassland or desert climates (Table 5). Differences were not consistent for all depths, nor when considering both %SOC and Cstocks.

Table 5 Games-Howell post-hoc analysis on the partial regression of Köppen climate classes to C indicator variables, controlled for other explanatory variables in the model

After controlling for other variables, the Games-Howell post-hoc analysis of %SOC and Cstocks to land-use indicated the highest C concentrations and stocks in native systems (unused timber/scrub and native grasses). C concentrations and stocks declined in the order: unused timber/scrub/native grasses > improved pasture ≅ grazing systems (including low-grazed systems, native pastures, set-stocking systems and rotational grazing systems) > softwood plantation ≅ modern farming systems (including rotational crop-pasture systems, minimum till systems and carbon farming systems) > tillage cropping > irrigated cotton ≅ dryland cropping.

Influences on SOC vertical distribution

The two continuous variables indicated as important in all the tree ensemble models and therefore included in the multiple regressions were temperature and evaporation (MATmax and EVAP), and both were negatively associated with the Cgradient. Of the categorical variables, only Köppen climate classes (KoppenCode) and land-use at sampling (LUorig) are considered here, as they were important in all models and significant in the multiple regressions. Although GSG was important in the tree ensemble models for the Cgradients of %SOC, no significant differences between different soil types were found using the Games-Howell post-hoc test.

After controlling for other variables, the Games-Howell post-hoc analysis of %SOC and Cstocks to Köppen climate classes revealed fewer significant differences in Cgradients than in %SOC or Cstocks. More significant differences were found in vertical distribution of Cstocks than in the vertical distribution of %SOC (Table 5).

Similarly, fewer significant differences were found for analyses of land-use in the Cgradients than in %SOC and Cstocks. Smaller Cgradients were found under cropped systems than under most other systems. Cgradients declined in order: grazing systems (including improved pasture, low-grazed systems, native pastures, set-stocking systems and rotational grazing systems) ≅ modern farming systems (including rotational crop-pasture systems, minimum till systems and carbon farming systems) > cropping.

Discussion

Model performance

For the tree ensembles, the R 2 indicate a good explanation of variance between measured and modelled SOC concentration and stocks (e.g. Wiesmeier et al. 2014), indicating their suitability at describing the observable variance in SOC dependent upon the explanatory variables available from sampling and site (GIS) data. The gap between modelled and total variance (~25–30 %) resembled closely the CV at site level (for 10 replicates per site CV averaged ~25 %, data not shown), so that our models appear not only adequate at explaining variance but do not overfit the data. There was no large difference in fitted model performance between the two measures of SOC content: %SOC performed slightly better for the individual depths, but for the combined 0–30 cm depth interval, Cstocks were described slightly better, indicating that the models perform equally well at describing trends in both SOC concentration and stocks.

With the exception of %SOC at 0–10 cm, the multiple regression models generally explained less variance than the tree ensemble models. However, given the fact that they were designed to analyse the influence of specific variables on SOC storage and vertical distribution, and not as predictive models their lower performance is not of great import.

Factors influencing SOC storage and vertical distribution

The tree ensemble models indicated similarities and differences in the variables important to SOC concentration and storage. Notably, although bulk density was very important to SOC concentrations, it was not an important factor determining C stocks. Relationships between SOC concentration and bulk density are well established and it is often assumed that SOC concentration affects bulk density (Ruehlmann and Körschens 2009). However, in compacted soils water infiltration, gas exchange, plant growth and root penetration are limited, inhibiting SOC production so that bulk density also affects SOC. As such, bulk density can be viewed as important indicator of soil physical health with a strong influence on SOC. That bulk density was important to SOC concentrations but not stocks, despite its occurrence in the formula used to calculate Cstocks, is likely a result of the inverse relationship between SOC concentrations and bulk density. As SOC concentration increases, bulk density decreases. However, the relationship between SOC concentration and bulk density is not linear, so that the product of these two variables is not clearly delineated rendering bulk density unimportant to predicting Cstocks. Due to the nature of the highly diverse soils and fixed sampling depths in the original datasets, we calculated SOC stocks without adjusting for the effects of SOC on bulk density and therefore sampling depth via equivalent soil mass (ESM) equation (Ellert and Bettany 1995). It is possible that using ESM to calculate SOC stocks may further illuminate the importance of bulk density to soil physical health and SOC stocks, and this should be the focus of future investigations.

The fact that the variables identified as important in the tree ensemble models differed (absolutely and in rank) between %SOC and Cstocks accounts for the different contributions to explained variance in the multiple regression models of the different factors (climate, site, land-use) for SOC concentration and stocks. A further explanation for the different contributions of the various factors to model variance is the differences in total variance explained by the models (SOC concentrations are better modelled than Cstocks), which leads to a shift in the relative contribution of a given factor to the explained variance. In this dataset, %SOC appears influenced by bulk density, whereas Cstocks, with one exception, are not, with the result that the multiple regression models describe Cstocks less effectively than %SOC, and the influence of site factors (which include bulk density) is greater for %SOC than for Cstocks.

Similar to other investigations into SOC storage at different depths (Jobbagy and Jackson 2000; Badgery et al. 2013), climate factors dominated as important explanatory variables for both SOC concentrations and stocks near the surface, with precipitation, vapour pressure deficit and relative humidity highly important in the top 10 cm but less important below the surface. Consistent with our understanding of the relationships between rainfall, net primary production (Michaletz et al. 2014) and SOC storage (Jobbagy and Jackson 2000; Wiesmeier et al. 2013b), MAP was positively associated with SOC content, whereas VPD was negatively associated with SOC content. These results suggest that SOC production near the surface is limited by water availability. The effect of water-limitation will be most noticeable in SOC content at the soil surface, as dryness throughout the soil profile will not only limit below-ground SOC production (i.e. root growth and exudates), but also above-ground plant productivity, e.g. leaf-litter production. Therefore, water-stress is likely to affect the SOC content at the soil surface to a greater degree than below the surface. Compounding this is the fact that soil temperatures are greatest near the surface, resulting in a higher evaporation and drier soils, which will further limit SOC production in shallow-rooting vegetation types (e.g. pasture).

Despite long-standing links between temperature and ecosystem productivity (Michaletz et al. 2014), temperature was not an important variable in the tree ensemble models of SOC content. However, temperature was highly important to the vertical distribution of SOC. Temperature has been shown to have a stronger association with SOC vertical distribution than precipitation (Wang et al. 2004), and we believe that our results indicate that SOC vertical distribution is driven more by degradation than by production processes. Microbial activity and SOC decomposition rates are positively associated with temperature, but are also limited by substrate availability (Kirschbaum 2006). Although water-availability will limit microbial activity (VPD is an important splitting variable in the tree ensembles), precipitation directly limits SOC production, and will therefore limit microbial activity and SOC degradation indirectly by limiting substrate availability. As the SOC content is generally greatest near the surface, higher temperatures will lead to enhanced turnover of SOC near the surface, where substrate limitation is less likely to affect microbial activity, resulting in a lower gradient of SOC from the surface to subsurface and thereby reducing the Cgradients.

With increasing soil depth, the Köppen climates classes (and other categorical climate descriptors) became more important to the models. This indicates that the absolute amount of rainfall, evaporation and humidity are less important than the climate patterns (i.e. seasonality, temperature regime). This can be explained by reduced water and temperature fluctuations with increasing depth from the soil surface, which dampens the absolute climate signal and results in SOC dynamics below the surface being driven by seasonal trends.

With increasing soil depth, the influence of climate variables diminished and site and land-use factors became more important to SOC storage. That site factors such as soil type, lithology and geology become more important with depth can be explained by the fact that the ability of the soil inorganic matrix to retain SOC is linked with mineralogy and texture (Six et al. 2002), which are derived from bedrock and weathering properties and are inherent to the sampling site. Near the surface, SOC content is highest and so fine minerals are most likely to be saturated with SOC, limiting their retention capacity. Below the surface SOC content generally decreases, so that saturation of fine minerals is less likely to be an issue and mineralogy becomes more relevant to SOC storage. This is consistent with the results of Grimm et al. (2008), who found that subsoil (10–50 cm) SOC variance was best explained by soil textural classes as derived from soil mapping units. These results imply that, for the environment and sites represented in this dataset, the amount of SOC produced (i.e. near the surface) at a site is limited predominantly by climate, above-all precipitation and water-availability, but that C retention (i.e. below the surface) is more closely linked with geological and mineralogical site properties than with climate variables.

Mineralogy and (the closely associated) soil texture have long been recognised as relevant to SOC retention (Christensen 1992), particularly in Australian soils (Hassink 1997). In our study we did not have access to soil textural data or measured mineralogical properties, so can only assess the importance of mineralogy via the (proxy) GIS variables related to geology and lithology. Our results highlight their importance to SOC storage below the surface and we strongly recommend the assessment of textural and mineralogical data in SOC research projects so that the importance of these variables can be better understood in future research.

It is notable that the site factors identified as important in the models were inherent to internal soil properties, i.e. bulk density, soil type, geology and lithology, and not descriptive of external (topographical) site properties such as elevation or TWI. Topography is one of Jenny’s soil forming factors (Jenny 1941) and has been found to be highly important to SOC stocks (Grimm et al. 2008; Davy and Koen 2013). However, those studies assessed considerably smaller areas, characterised by lower climate variance, but highly varied topographies. Minasny et al. (2013) suggest that local terrain attributes (slope, aspect, curvatures) are important at small scales (<100 m) but that position in landscape is more important at larger scales (>100 m). Our results suggest that despite the fact that topography varies greatly across NSW, for the scale of this study, which encompasses a large area around the size of Germany, Poland, the Czech Republic, and Austria combined, topographical features are not important drivers of SOC storage and vertical distribution. Instead, climate, geology and humans have the greatest influence on SOC stocks and vertical distribution.

Traditionally, land-use has been reported to affect SOC content near the surface (0–10 cm), with effects more difficult to detect deeper in the soil (Wilson et al. 2008, 2010; Luo et al. 2010). However, the tree ensemble models all indicated land-use as highly important to C content below the surface and the contribution of land-use to explained variance in the multiple regression models more than doubled from the surface 10 cm to the 20–30 cm depth. Further evidence for the importance of land-use in determining subsurface OC content is the identification of land-use as by far the most important variable to C vertical distribution in the tree ensemble models, and its accounting for around two-thirds of the variance in the multiple regression models of SOC vertical distribution.

That land-use becomes important with depth may reflect stratification of SOC down the soil profile under different land-use systems as a result of differential accumulation or loss or SOC at a different depths down the soil profile, for example accumulation of SOC at the surface of unused timber/scrub areas, indicated by their higher SOC content and gradients. For pasture systems, it appears that grazing may reduce the input of SOC at greater soil depths (via removal of plant matter at the surface, thereby lowering translocation), resulting in lower SOC content below the surface and therefore higher vertical distribution indicators. Furthermore, the deposition of animal faeces onto the soil surface will help to enrich the soil surface in organic matter, enhancing the gradient from surface to subsurface. Alternatively, land-use can affect vertical distribution via mixing of the soil, as indicated by the low Cgradient and SOC content in conventionally cropped systems.

Land-use not only affected the vertical distribution of SOC to a depth of 30 cm, but also the absolute SOC storage, with native, unused systems storing significantly larger amounts of C than anthropogenically managed systems. These results confirm the current understanding of the effect of land-use and soil disturbance on soil carbon stores, namely that compared with natural systems, the greatest depletion of SOC occurs in highly disturbed, cropped and tilled systems, with modern farming methods (including rotational crop-pasture systems, minimum till systems and carbon farming systems) and grazed systems not as greatly depleted in SOC.

Finally, the results and discussion related to the vertical distribution of SOC have been drawn on data from the top 30 cm of soil. As outlined in the Introduction, Australian soils are unique and it is likely that the ‘topsoil’ (i.e. active/reactive soil zone) can adequately be assessed in the 0–10 cm depth. Although tillage depth in the State is generally 10 cm, occasionally deeper tillage to depths of 20–25 cm are used for specific purposes (e.g. the placement of ‘deep’ P fertiliser), which may affect the indices used for assessing the factors influencing SOC vertical distribution of SOC, as will the rooting depth of grasses. Although many Australian soils are shallow, with depths of <1 m (Soil and Landscape Grid of Australia 2014) and exhibit poor profile development, generalisation of these results to entire soil profiles should be treated with caution. Instead, the results and discussion presented here can be considered a starting point to assess the drivers of the depth distribution of SOC in eastern Australia. Specifically, the hypotheses developed based upon these results, namely that climate is important near the surface but that geological and mineralogical characteristics become more important in subsoils, that precipitation is the largest climatic driver of SOC stocks, and that the depth distribution is driven more by temperature than by precipitation should be tested in future research exploring greater soil depths.

Conclusions

We investigated the drivers of soil organic carbon (SOC) storage at three depths (0–10, 10–20, and 20–30 cm) as well as the gradient of SOC from the surface to subsurface in the soils of New South Wales, Australia using a combination of machine-learning and classical statistics. Our results indicate that the storage of SOC near the surface is driven predominantly by climate, most specifically water availability, indicated by the positive association of SOC with precipitation and relative humidity, but negative association with vapour pressure deficit. With increasing soil depth the influence of climate waned and inherent site factors (bulk density, soil type, geology and lithology) and land-use became more important to SOC content. Below the surface, seasonality and climate regimes appear to be more important than absolute values of precipitation and vapour pressure deficit. This we attribute to the dampening of the absolute climate signal (e.g. temperature fluctuations, wetting–drying cycles) at greater soil depths. Temperature was not important to SOC content, but highly important to SOC vertical distribution, and we hypothesise that this is a result of the temperature dependence of microbial activity and SOC degradation. In warmer climates, microbial activity is enhanced, and is greatest near the soil surface, where SOC content is highest. Thus, warmer temperatures lead to a comparatively greater turnover of surface SOC and a correspondingly lower gradient of SOC from surface to sub-soil than in colder climates. Land-use affects the absolute storage of SOC, with natural systems containing the greatest amounts of carbon and conventionally cropped systems the least. Grazed systems and modern farming systems (e.g. rotational cropping-grazing or carbon farming systems) had intermediate SOC contents between these end-members. Importantly, land-use is the most important predictor of the vertical distribution of SOC in the investigated depths, which can be attributed to preferential accumulation at the soil surface (unused timber/scrub systems), reduced input into the sub-soil (grazed systems) or mixing of the soil profile (cropping). Lastly, our study assessed SOC storage and vertical distribution in the top 30 cm of soil. Although this is a good starting point for investigating SOC storage and vertical distribution, future research efforts should be focussed on testing the hypotheses developed here regarding the relative importance of the numerous drivers of SOC at multiple depths as well as the depth distribution of SOC in deeper soil profiles.