1 Introduction

Understanding how the global hydrologic cycle has responded to climate change (natural or anthropogenic) in the past and will likely respond to climate change in the future is imperative to ensuring the efficacy of adaptive planning measures that aim to minimize the adverse socio-economic and environmental impacts of climate change. Increases in the frequency and severity of floods and droughts (e.g., Sheffield and Wood 2008; Huntington 2006), heatwaves (e.g., Schar et al. 2004), wildfires (e.g., Westerling et al. 2006; Moritz et al. 2012), and strains on water and food security (e.g., Lobell et al. 2008), have all been linked to climate change. Without advanced warning or sufficient resources to mitigate their effects, even modern societies risk destabilization as a consequence of climate extremes (e.g., Hsiang et al. 2011).

In theory, observation-based global reanalyses such as NASA’s Modern Era Retrospective-analysis for Research and Applications (MERRA; Rienecker et al. 2011; Rienecker et al. 2008) provide a robust means for detecting long-term trends and/or abrupt shifts in the hydrologic cycle, and more importantly, enable attribution to their root mechanisms. In practice, inhomogeneities caused by forecast model biases and/or the observations they assimilate limits their applicability for long-term trend assessment (Thorne and Vose 2010; Dee et al. 2011b). Detecting and correcting for observational biases is complicated by the diversity of the observations (e.g., in source, coverage, and record length). The challenge becomes how to distinguish between real climate shifts and artificial observational “shocks.” From an operational perspective, diagnosing the latter (e.g., sensor-related breakpoints) is critical to the success of variational bias adjustment methods (Dee et al. 2011a).

So-called “climate quality” reanalyses of which the Twentieth Century Reanalysis (20CR; Compo et al. 2011) is the first, seek to ameliorate the issue of unphysical time-varying biases through the assimilation of only those data streams that are stable over long periods of time. For example, 20CR assimilates only synoptic surface and sea-level pressure observations that span a period of 140-years (1871–2010). If shown to be homogenous, it would provide the first comprehensive (i.e., multivariate, multi-level) and consistent long-term climate record suitable for trend assessment, including assessment of whether the hydrologic cycle is intensifying (Huntington 2006).

In a previous study (Ferguson and Villarini 2012), we suggested that the fivefold increase of 20CR’s assimilated observation counts in the 1940’s over the central U.S. caused inhomogeneities during the same period. We also showed that, depending on the season, the complete (140-year) record could be considered homogenous. The purpose of this paper is to provide a comprehensive global follow-up to our previous work using a similar methodology. Specifically, we address the following questions:

  • (1) Is the finding that inhomogeneities in 20CR are linked to underlying observational density unique to the central U.S., or globally-representative? If other artificial (non-climate) inhomogeneities are detected, what is their frequency relative to those that are naturally occurring (as a product of climate variability)?

  • (2) For what fraction of the globe is 20CR homogeneous over the period of record? And, for surface air temperature and precipitation, how does this compare with available global gridded in situ datasets?

  • (3) What is the size distribution of the discontinuities? And how are they distributed in time?

  • and (4) How varied are inhomogeneity characteristics among 20CR’s variable fields?

Our approach is to assess homogeneity on a global grid point basis and summarize results regionally for a large subset of variables. The overarching motivation for this study, which is to use 20CR to identify the key processes, feedback mechanisms, and hydrometeorological variables that drive long-term changes in the hydrologic cycle at regional scales (e.g., Troy et al. 2012), dictates that the homogeneity assessment be conducted at seasonal time step, but this is not always practical. In our case, we choose to focus on the minimum and maximum months in the seasonal cycle of global homogeneity.

The paper is organized as follows. Section 2 describes the 20CR, comparison datasets, and the full methodology, including the statistical test that we apply. Results for each of the experiments are presented in Sect. 3. Section 4 includes a brief summary and conclusions.

2 Data and Methods

2.1 20CR

The 20CR is a global atmospheric reanalysis spanning the 140-year period from 1871 to 2010 at 2.0° spatial resolution and 6-hourly temporal resolution with 24 atmospheric levels (Compo et al. 2011). It is remarkable not only because it more than doubles the pre-existing reanalysis record length but because only two surface observations are used. Namely, six-hourly surface- and sea-level pressure observations from the International Surface Pressure Databank (ISPD v2.2.4) and monthly sea surface temperature (SST) and sea-ice concentration fields from the Hadley Centre Sea Ice and SST dataset (HadISST v1.1; Rayner et al. 2003). The ISPD v2.2.4 contains millions of observations from International Comprehensive Ocean–Atmosphere DataSet (ICOADS) v2.2 (Worley et al. 2005) as well as newly digitized data from land stations that have never before been used. HadISST v1.1 (described in Sect. 2.2.3 below) incorporates many types of observations, in situ as well as from satellites.

The 20CR employs a deterministic Ensemble Kalman Filter (EKF) based on the ensemble square root filter algorithm of Whitaker and Hamill (2002). Background first guess fields are obtained from a short-term forecast ensemble run in parallel, consisting of 56 9-hour integrations of the April 2008 experimental version of the U.S. National Centers for Environmental Prediction (NCEP) Global Forecast System (GFS; Kanamitsu et al. 1991; Moorthi et al. 2001; Saha et al. 2006)—each initiated using the previous 6-hour analysis. The GFS is coupled with the four-layer NOAH v2.7 land surface model (Ek et al. 2003) and run at a horizontal resolution of T62 (192 × 94 Gaussian longitude/latitude) and 3-hourly time step with 28 hybrid sigma-pressure levels. For each time-iteration of the assimilation (6-hourly) and forecast (3-hourly) systems, the ensemble mean and uncertainty estimate (i.e., ensemble spread) are recorded for all variable fields. ISPD surface pressure and sea level pressure observations are independently quality controlled during the assimilation cycle (i.e., ISPD quality controls were not used) through a multi-step procedure that includes a basic check for meteorological plausibility, comparisons with the first guess ensemble and neighboring observations, and for the station (land) pressure observations only, an adaptive time-varying platform-by-platform bias correction scheme. The scheme corrects for statistically significant differences between the first guess and observational means over 60-day increments of the assimilation, which has the effect of smoothing out any sudden shift in observations over a period of a few months (Compo et al. 2011; their Appendix 2). No bias correction was performed on the marine and tropical cyclone ‘best track’ pressure observations and reports. For maximum computational efficiency, 5-year production streams (with 14-month spin-up) were used, with the exception of a single 6-year stream for the period 1946–1951, and the most modern period 2001–2010.

We analyze the monthly or annual time-average of 26 variables in total: four analysis fields, 17 forecast first guess fields, and 5 derived quantities (Table 1). We focus on the ensemble mean fields, except in Appendix 1, where we analyze the every-member (n = 56) data available at two locations: Geneva, Switzerland (46.20°N, 6.15°E) and Rondonia, Brazil (24.0°S, 51.0°W). The ensemble spread fields are analyzed jointly in order to diagnose the nature of the breakpoint (i.e., real or artificial). Note that the ensemble spread is an estimate of the uncertainty in the 6-hourly analyses and the 3-hourly forecast values, not the estimate of uncertainty in the monthly mean values themselves. Also important to note is that the uncertainty estimates do not account for uncertainties in the SST, which are considerable (Rayner et al. 2003; Kennedy et al. 2011a, 2011b).

Table 1 The 26 variable subset of 20CR analyzed in this study and their abbreviated names

The derived quantities were selected on the basis of their relevance to studies of land–atmosphere interaction and the global water and energy cycles (e.g., Ferguson and Wood 2011; Betts 2009; McVicar et al. 2008; Ferguson et al. 2012). They are: total column moisture convergence (C), atmospheric-inferred evapotranspiration (E), 10 m wind speed, convective triggering potential (CTP), low-level humidity index (HI), and lifting condensation level (LCL). To be clear, E is computed by:

$$E = P - C + \frac{dw}{dt},$$
(1)

where P is obtained from the forecast, C is calculated using the analysis surface pressure and multi-level humidity and wind fields, and dw/dt, the change in total column moisture, is calculated by taking the difference between first of the month (e.g., February 1 minus January 1) total column precipitable water vapor/ice (PWV) analysis fields. The CTP is a measure of departure from the moist adiabatic temperature lapse rate from 100 to 300 hPa above ground level (AGL). The HI is defined by the 50–150 hPa AGL dew point depression. The LCL is computed from a parcel originating at 2 m and lifted along a dry adiabat.

2.2 Comparison data

2.2.1 CRU

The Climate Research Unit (CRU) time series dataset version 3.1 (TS3.1) is a 0.5° gridded record of monthly land surface climate (precipitation, mean temperature, diurnal temperature range, and other secondary variables) for the period 1901–2009, derived entirely from daily surface meteorological observations (Mitchell and Jones 2005; Mitchell et al. 2004; New et al. 2000). For this study, we use the 2 m air temperature and corresponding station count data only. TS3.1 fields are the product of an angular-distance-weighted interpolation of monthly climate anomalies relative to the 1961–1990 mean, subsequently recombined with an equivalent grid of normals for the same baseline period (New et al. 1999). In estimating each grid point, TS3.1 uses the eight nearest station records, regardless of direction, within an empirically derived correlation decay distance (CDD) of 1,200 km for temperature (New et al. 2000). If a grid point lies beyond the CDD of any stations, the grid is ‘relaxed’ to the 1961–1990 mean. We found that an entire year (i.e., 12 consecutive months) or longer is relaxed at some point (generally, earlier) in the record at 45 % of the grids. Major sources of error in the TS3.1 include instrumental measurement error, insufficient station density, and interpolation errors (New et al. 2000).

The temperature database on which CRU TS is based was assembled in the late 1990’s; only updates from the Monthly Climatic Data for the World (MCDW), monthly climatological data (CLIMAT), and various Bureau of Meteorology (BOM) reports are routinely incorporated. Unlike in previous versions of the TS (TS2.1; Mitchell and Jones 2005), neither homogeneity assessment nor homogenization of the ingested data streams is performed in the production of TS3.1. Nevertheless, a large number of its data sources, including the Global Historical Climatology Network (GHCN; Lawrimore et al. 2011), are received by CRU in homogenized state.

2.2.2 GPCC

The Global Precipitation Climatology Centre (GPCC; Becker et al. 2012, Schneider et al. 2013) produces a monthly gauge-based precipitation reanalysis at 0.5° spatial resolution that spans the twentieth century (1901–2010). It is commonly referred to as the Full Data Reanalysis. In this study, we use the latest release, version 6 (December 2011). It is based on the world’s largest and most comprehensive collection of precipitation gauge data. This includes: data from over 190 national weather service networks; daily surface synoptic observations (SYNOP) and monthly CLIMAT messages transmitted via the World Meteorological Organization (WMO) global telecommunication system (GTS); global precipitation data collections from CRU, GHCN, and the Food and Agriculture Organisation (FAO); in addition to numerous other regional datasets. Only stations with ten years of data or more are included. After the gauge data (and/or metadata) are received, they are subjected to rigorous comparative analysis (screening) against different sources of data relevant for the same or neighboring stations, as well as a gridded background anomaly. Once screened and (if necessary) corrected, GPCC applies a modified version of the SPHEREMAP method (Willmott et al. 1985) to spatially interpolate station anomalies to grid anomalies, drawing from the data of 16 nearby stations. In the present version (v6), the normal fields are the product of observations from approximately 67,200 stations. A count of contributing gauges is provided for each estimate of P.

The Full Data Reanalysis was not designed to achieve temporal homogeneity and is therefore not recommended for climate trend analysis. An alternative GPCC analysis, VASClimO, which includes only those stations with data coverage for 90 % (45 years) of its record length, was intended for this purpose. It covers, however, only a fraction (1951–2000) of our period of interest and for that reason it is not used in this study.

2.2.3 HadISST v1.1

The Met Office Hadley Centre’s globally complete monthly sea ice concentration and SST dataset version 1.1 (HadISST v1.1; Rayner et al. 2003) covers the period from 1870–2010 at 1.0° spatial resolution. It uses gridded, quality-controlled in situ observations for 1871–1981, merged with night-time bias-adjusted National Oceanic and Atmospheric Administration (NOAA) satellite-borne Advanced Very High Resolution (AVHRR) observations from January 1982 onwards. The gridded data for 1871–1941 were bias-adjusted to account for uncertainty in sampling methods following Folland and Parker (1995). A two-stage (global and inter-annual) reduced space optimal interpolation (RSOI; Kaplan et al. 1997) procedure was applied to reconstruct the complete (spatial and temporal) SST fields. Quality-improved (homogenized for variance) gridded data is blended with the reconstructed fields to restore local (~500 km) variance attributes. HadISST v1.1 is of particular relevance to our work because it supplies the boundary conditions for the 20CR (see Sect. 2.1).

2.2.4 HadSLP2

The Met Office Hadley Centre’s globally complete monthly mean sea level pressure (SLP) dataset version 2 (HadSLP2; Allan and Ansell 2006) covers the period from 1850 to 2004 at 5° spatial resolution. It is an RSOI reconstruction (like HadISST v1.1) based on a blending of monthly mean SLP observations from 2,228 land stations with gridded marine SLP observations from ICOADS v2.2. The marine component of the ISPD v2.2.4 used in the 20CR is extracted from ICOADS versions 2.4 and 2.5 for the periods of 1952–2010 and 1871–1951, respectively. Consequently, there is a high degree of overlap in the observational content.

The HadSLP2 terrestrial data are subject to a large number of quality control procedures including temporal and spatial consistency checks and a Kolomogorov-Smirnov (K–S) test (Press et al. 1992) for inhomogeneities in the seasonal mean. In the case of the marine data, the Hadley Centre Marine Data System (MDS) version 2 was applied. MDS includes climatology and near-neighbor spatial consistency checks. Time series quality-cleared by MDS were then subjected to further correction according to procedures described in Ansell et al. (2006).

Along with SLP, the observation counts and uncertainty estimates are also provided. Of course, uncertainty estimates are only calculable for the month and grid point for which data has been assimilated. For HadSLP2’s period of overlap with 20CR (1871–2004), only 2.2 % (n = 60) of grid points offer a continuous record of uncertainty.

HadSLP2 was extended from 2005 to present using NCEP-National Center for Atmospheric Research (NCAR) reanalysis (denoted as “R1”; Kalnay et al. 1996; Kistler et al. 2001). However, this more modern record, named HadSLP2r, is not homogeneous with the earlier time series (see http://www.metoffice.gov.uk/hadobs/hadslp2/). For the above reasons, we use only HadSLP2 and its corresponding observational count record.

2.2.5 COBE SST

The Japan Meteorological Agency’s globally complete monthly sea ice concentration and SST dataset (COBE SST; Ishii et al. 2005) covers the period from 1891 to present at 1.0° spatial resolution. It uses quality-controlled SSTs from ICOADS v2.0, the Japanese Kobe Collection, the Canadian Marine Environmental Data Service (MEDS) buoy dataset, as well as ship reports. As in HadISST v1.1, biases in bucket observations before 1941 are removed using the method of Folland and Parker (1995). The objective analyses are based on optimum interpolation and a monthly reconstruction with empirical orthogonal functions intended to homogenize the data.

2.2.6 ERSST v3b

The NOAA Extended Reconstruction Sea Surface Temperature version 3b (ERSST v3b; Smith and Reynolds 2003) is a globally complete monthly sea ice concentration and SST dataset covering the period from 1854 to present at 2° spatial resolution. It is based upon statistical interpolation of quality-controlled ICOADS v 2.4 data and does not include satellite data due to a cold bias in the satellite-derived SSTs that proved difficult to correct. The spatial variance ratio of the SST is measurably less than that of HadISST v1.1 because filtering of modes is applied to reduce small-scale noise.

2.3 Methods

In this study, we rely exclusively on the results of the Bai-Perron structural change point test (Bai and Perron 2003). This test is well suited to our purpose for two reasons: it is objective and it has the capability to detect multiple breakpoints. Multiple breaks are the norm rather than the exception in 20CR’s 140-year record (Ferguson and Villarini 2012). Seventy-three percent of all grid locations have more than one break in annual mean T a (not shown). The test must be objective because it needs to be automatable for bulk application on a global grid basis (n = 16,200 @ 2.0°). Finally, we found the Bai-Perron test to be of comparable skill to the Pettitt test (Pettitt 1979) in zero and single break cases; the Pettitt test corroborates 90 % of homogenous series and 73 % of single break dates (not shown).

2.3.1 Bai-Perron test

The Bai-Perron test (Bai and Perron 2003) enables the simultaneous estimation of multiple change points of unknown timing. The data are assumed to come from a distribution belonging to the exponential family (e.g., Gaussian, exponential, Poisson). The test represents an extension of F statistical tests against a single-shift (e.g., Andrews 1993) for multiple break applications. It is based on a standard linear regression model for which the null hypothesis of structural stability is tested against the alternative that at least one coefficient varies with time. We use a constant as the regressor for our model. The minimum permissible segment length (i.e., trimming parameter) is set by the user. We chose to set this parameter, h, to 0.15 (default value in the package we used), which equates to allowing up to five breaks in the 140-year (1871–2010) 20CR. Note this parameter also dictates the earliest and latest possible break date. First, the number of breaks is selected using BIC (Bayesian Information Criterion; Schwarz 1978). Then, dating of the change points is accomplished via a dynamic programming approach that minimizes the objective function (RSS) (Bai and Perron 2003). The statistical confidence intervals corresponding to each change point are computed using the distribution function of Bai (1997), although in a limited number of cases (<1.5 % in this study), errors preclude their estimation (i.e., singular gradient). When calculable, we include the 95 % confidence interval. Modifying the confidence level affects only the length of the confidence interval(s), not the number of change points detected. The Bai-Perron test is non-optimal for cases in which the record length is short, the break sizes are small, and/or the breaks are clustered (Bai and Perron 2003). In addition to abrupt shifts in the mean, changes in series variability or the presence of gradual trends can also lead to breakpoint detection.

Our results were obtained using R 2.14.2 (R Development CoreTeam 2008) with the packages strucchange v1.4-6 (Zeileis and Kleiber 2005; Zeileis et al. 2003), sandwich v2.2-9, and zoo v1.7-7.

2.3.2 Experimental design

We apply the Bai-Perron test on a 2° grid-by-grid basis to the ensemble mean and uncertainty estimate fields of 26 variables (see Sect. 2.1, Table 1). We summarize the results for the area of 60°S–60°N as well as 27 constituent climatic regions of which 22 are over land and the rest are over ocean (Fig. 1; Table 2). Whether a time series is inhomogenous is case dependent. From a statistical perspective a time series may be considered inhomogenous if it has any breakpoints while in a climate sense inhomogenous means affected by changes that are not of climate origin. In this study, we will focus on inhomogeneities in the second sense, although information on climate-related breakpoints will be presented in figures.

Fig. 1
figure 1

Regions over which the analysis was conducted (n = 27; see Table 2). The delineation over land is based on that of Giorgi and Francisco (2000), but modified to better reflect the land–atmosphere coupling and climatological wetness regimes shown by Ferguson et al. (2012). Ocean domains are defined according to standard Japan Meteorological Agency conventions

Table 2 Description of the regions used in this study and the number, n, of 2° grid cells they comprise

We focus on T a and P because of their prominent role in global climate, but also because (along with surface pressure) they are the most widely (and accurately) monitored meteorological quantities. Due to high confidence in their measurements (especially T a), they commonly serve as benchmarks in model performance. One of our objectives is to inform users of their homogeneity characteristics so that the fields are not applied inappropriately in some form of climate model evaluation.

We focus on 20CR’s mean fields (i.e., official 20CR product) because they are the most widely applied. However, we acknowledge that homogeneity will vary among 20CR’s 56 ensemble members and between ensemble members and the ensemble mean. In Appendix 1, we present results from our every-member analyses of T a and P for Geneva, Switzerland and Rondonia, Brazil. We found that coincident discontinuities in as few as five ensemble members could lead to a detectable shift in the ensemble mean.

We define a non-climate (i.e., unphysical or artificial) break as a breakpoint in the time series of the mean variable field whose 95 % confidence interval overlaps (for any number of years) with the 95 % confidence interval of a breakpoint in the time series of the corresponding ensemble spread (e.g., P and P spread). Substitute variable spread fields are used for derived variables that have no associated spread field. The meridional wind (vgrd) ensemble spread is used to detect non-climate breaks in 10 m windspeed (WSPD); the convective available potential energy (CAPE) ensemble spread is used to detect non-climate breaks in CTP; the 2 m minimum air temperature (TMIN) ensemble spread is used to detect non-climate breaks in: C, E, HI, and LCL. Our definition of non-climate breaks may be sensitive to the size of confidence intervals in cases in where the confidence intervals for both fields are relatively wide. In the case of monthly T a, confidence intervals range from 3 to 84 years in length, with a median length of 18 years. For monthly spread in T a the range is similar, although the median length is substantially shorter (11 years).

The underlying rationale for our non-climate definition derives from the fact that the ensemble spread typically varies as an inverse function of the assimilated observation count (Ferguson and Villarini 2012; their Fig. 1). If break dates are coincident between variable mean and spread fields, then the logic follows that observational network changes are likely the discontinuities source. However, over ocean we found that the inverse relationship between T a spread and assimilated observation count is not always upheld (i.e., Figs. 5b, h, i, 16). While one plausible explanation is that the ensemble is tightly constrained to a time invariant constant by the specified HadISST v1.1 field, it does not explain how there can still be variability in the TMIN spread (Fig. 16). Because it’s time series appears more realistic, we use TMIN spread in place of T a spread.

3 Results

3.1 Homogenous fraction and seasonality

Since we first reported evidence of observational shocks in 20CR’s record over the central U.S. (Ferguson and Villarini 2012), an open question has been: how pervasive are such effects globally and how do they vary seasonally? In Fig. 2, we present global maps of non-climate breakpoint counts for each monthly time series of T a. Consistent with earlier work, we find a sizeable seasonal component to 20CR’s inhomogeneities (and their detectability), especially in the northern extratropics (Fig. 3). For the period 1871–2010, the global fraction of statistically homogenous grids (grids with natural breaks-only) can be seen to range from 10 % in July and August (28 % in May) to 21 % in February (39 % in January; Fig. 3). Both climate and non-climate inhomogeneities in 20CR’s other variable fields track the same general seasonality (i.e., the homogenous fraction peaks during northern hemisphere winter and dips during the northern hemisphere summer; not shown). Because February and July typically constitute the months of maximum and minimum homogeneity, respectively, we chose to focus the remainder of our analysis on them. Relatively greater homogeneity observed in the northern hemisphere (Figs. 2, 3, and S1) might have been anticipated from the hemispheric local anomaly correlation results presented in Compo et al. (2006; their Figs. 7 and 10).

Fig. 2
figure 2

Global monthly non-climate breakpoint count in 20CR T a. For the total breakpoint counts, including both physical and non-physical breaks, see Fig. S1

Fig. 3
figure 3

For the globe, 60°S–60°N, and northern and southern extratropics, the normalized fractional coverage in 20CR T a that is unaffected by inhomogeneities (i.e., statistically homogenous; solid line) or affected only by changes of climate origin (line with filled circular marker). Note that the difference of the sum of these terms from unity is approximately (because, over the 140-year record, both climate and non-climate breaks can occur at a single point) equal to the fractional area contaminated with non-climate changes

In Fig. 4, the February and July inter-variable and inter-regional differences in the areal extent of non-climate (unphysical) changes are summarized for a 26 variable subset (Table 1) of 20CR over 60°S–60°N in addition to 27 smaller regions (see Table 2). It shows that on-average 46–59 % (February-July) of grids between 60°S and 60°N are affected by non-climate changes, which is less than that for T a (February: 0.59; July: 0.68) but more than that for P (February: 0.33; July: 0.53). The upward longwave radiation flux (LW) and 10 m wind speed (WSPD) are the least and most contaminated with artificial shocks, respectively. Their 60°S–60°N affected coverage range from 16 % (LW) and 72 % (WSPD) in February to 22 % (LW) and 79 % (WSPD) in July. Overall, Fig. 4b, c can serve as a valuable reference for users looking to isolate regions where long-term trend assessment is currently feasible (or not). Conversely, Fig. S2, which shows the areal extent of climate-related changes is typically between 20 and 23 % of grid points, is valuable for further analysis of climate variability.

Fig. 4
figure 4

Bai-Perron test results for February and July time series of 26 20CR variables (see Table 1 for variable name definitions). a For 60°S–60°N, the fraction of grid cells (land and ocean combined, except for Q and G, which are defined over land only) that are affected by non-climate breaks at some point over the period of availability (1871–2010). b and c same as in (a) but on a regional basis (see Fig. 1, Table 2). In (a), the multivariate median values for February (gray) and July (black) are marked by horizontal lines. Results for T a (red), P (blue), and the multivariate median (black) are highlighted in (b) and (c). See Fig. S2 for the complimentary figure (i.e., the fraction of grid cells affected by climate-related breaks)

According to the multivariate median, Northern Europe (NEU) is the least affected by non-climate breaks, both in February (<1 %) and July (31 %). In February, northwestern Canada (NWC), northeastern Canada (NEC), western U.S. (WUS), central U.S. (CUS), eastern U.S. (EUS), and Mediterranean (MED) are also relatively unaffected. Eastern Africa (EAF; 0.78) and Sahara (SAH; 0.82) are the most affected domains in February and July, respectively. February-July differences in the multivariate median affected area fraction (Fig. 4b, c, bold black line) average 21 %, but range between less than 1 % (Amazon: AMZ; Indian Ocean: IO; and southern Africa: SAF) and as much as 52 % (EUS). The February-July difference is less than 10 % for the following regions: EBR, southern South America (SSA), Congo (CON), EAF, SAF, southeast Asia (SEA), and IO.

In general, these findings hold qualitatively for climate-related changes as well (see Fig. S2). The areal extent of climate- related breakpoints is highest in AMZ, EBR, and Africa (except SAF). Remarkably little area (2–4 %) in Australia (AUS) is affected by changes of climate origin (Fig. S2b, c).

3.2 Breakpoint size distribution and detectability

Considering that 20CR is statistically inhomogenous at the majority of grids (Fig. S1), a key question is: what is the typical jump size associated with these breaks? In Fig. 5, we provide a sampling of eight inhomogenous grid records each for T a and P from around the world. They are representative of the array of detectable inhomogeneity, ranging from instantaneous (e.g., Fig. 5a) to gradual (e.g., Fig. 5c), and with varying jump sizes. For added reference, the spread time series and breakpoint record in comparison datasets, CRU T a and GPCC P, are included. Although no breakpoints were detected in GPCC P. Breakpoint summaries for the full 26-variable subset of 20CR (Table 1) at these same grid points are provided in Fig. S3. While the breaks in T a and P highlighted in Fig. 5 do pervade through multiple (if not most) modeled variables, Fig. S3 shows this is not the rule.

Fig. 5
figure 5

For selected grid points, the time series of 20CR ai T a and jr P (in black) and their respective ensemble spread (in blue). Vertical dashed red lines denote detected breaks in the variable mean fields (confidence intervals not shown). CRU T a and GPCC P records, available over land points (ce, g, and mq), were also evaluated for breaks. Breaks detected in the CRU T a are denoted by vertical cyan lines (confidence intervals not shown); no breaks were detected in GPCC P over the selected grid points. The nearest major city is noted, within reason

In several instances the abrupt shifts in variable mean correspond with those in the variable ensemble spread (Figs. 5 and S3). Such coincidences are strongly suggestive of unphysical inhomogeneities (Ferguson and Villarini 2012). The fact that only one breakpoint in one location is corroborated by a break (95 % confidence interval; not shown) in CRU T a (Fig. 5d) further supports this conclusion. Finally, the ocean grids for which the T a spread is time invariant (Fig. 5b, h, i) are examples of why TMIN spread is used instead for diagnosing non-climate breakpoints (see Sect. 2.3.2).

In Fig. 6, the full distribution of detected breakpoints (n = 853,376) for 26 variables and 2 months (February and July) is summarized according to jump size. The jump sizes are normalized by the standard deviation of the preceding series segment to enable inter-variable comparison (Fig. 6a). The variable frequency polygons are generally in close agreement with regards to mean, spread, and skew. There is remarkably strong consensus that the distribution is positively skewed; 72 % of breaks exceed one standard deviation in magnitude; 50 % of breaks exceed 1.3 standard deviations in magnitude; and 25 % of breaks exceed 1.8 standard deviations of the preceding time series. Figure 6b, c shows the absolute jump size distributions for T a and P, respectively. One-half of breaks are shown to exceed 0.7 °C and 11.9 mm month−1, respectively.

Fig. 6
figure 6

For the globe (90°S–90°N), the normalized frequency polygons of breakpoint jump size in (a) units of standard deviations, and for b T a and c P, as an absolute quantity. The gray bars in (a) characterize the set of all detected breakpoints (i.e., physical and non-climate breaks for February and July) in 26 variable mean fields (n = 853,376). Each separate variable polygon (green, blue, and red lines) is normalized by its own respective count total; the global ocean and global land polygons are normalized by the total global count. Annotated percentiles correspond with the multivariate global (gray bars) polygon only. Polygons have a bin size of 0.2 in (ab) and 5.0 in (c). The polygon for surface pressure is the most peaked with 47 % of breaks less than one standard deviation in size

In many applications, especially in an operational setting, knowing the detectability limits is desirable. For the Bai-Perron test, we recommend assuming a detectability limit of 0.7 standard deviations computed from the time series prior to the change point, equivalent to the fifth percentile of multivariate detected jump size (0.3 °C and 1.4 mm month−1 for T a and P), globally (Fig. 6). It could be the case that test sensitivity exceeds that which is required or meaningful for the application at hand. For example, Figs. 7 and 8 illustrate the monthly global distribution of minimum detected jump sizes in T a and P, respectively. The smallest shifts over the Tropics and coastal areas for T a and deserts of Africa for P might be inconsequential. Notably, we found no substantial difference in the mean jump sizes of non-climate breaks related to observational network changes (next section) and natural breaks related to climate variability.

Fig. 7
figure 7

For 20CR T a, the global monthly distribution of minimum detected jump sizes (| °C|)

Fig. 8
figure 8

As in Fig. 6, for 20CR P (|mm month−1|)

3.3 Non-climate breakpoints

Non-climate (unphysical) inhomogeneities are diagnosed using the joint confidence intervals of breaks detected in the variable and spread fields (Sect. 2.3.2). Their fraction of the total February and July breakpoint counts is summarized in Fig. 9. As before, a 26-variable subset of 20CR is considered over 60°S–60°N (Fig. 9a) and 27 constituent climatic regions (Fig. 9b, c). The multivariate mean non-climate fraction for 60°S–60°N found to be approximately 0.72 (for both February and July), which slightly exceeds that of T a (February: 0.70; July: 0.64) but not P (February: 0.80; July: 0.82). Non-climate breaks constitute the least proportion (0.20) of breaks in surface upward longwave radiation (LW↑). Regionally, the largest non-climate fractions (0.95) are reported for CAS (February) and Australia (AUS; February and July). The smallest non-climate fraction (0.60) is found for tropical Pacific Ocean (TPO; February) and North Atlantic Ocean (NAO; July). Neither the 60°S–60°N results nor the domain results exhibit seasonality in their multivariate median non-climate fraction (Fig. 9b, c; except in the case of CUS and EUS, for which the February breakpoint population size was insufficient).

Fig. 9
figure 9

Fraction of all detected breaks in the February and July time series that can be attributed to non-climate (i.e., observational network) sources. a For 60°S–60°N (land and ocean combined, except for Q and G, which are defined over land only), and for b, c each of 27 land- and ocean-only climatic regions. The multivariate regional medians [bolded black line in (b) and (c)] are computed from the set of all single-variable values with underlying sample sizes of 50 or more non-climate breaks in the specific region (denoted by a circle)

Figure 10 details the spatio-temporal distribution of non-climate breaks. For 60°S–60°N, the inter-variable range in median non-climate break dates is 1924–1959 for February and 1936–1950 for July; the multivariate mean of median non-climate break dates is 1947 and 1944 for February and July, respectively (Fig. 10a). Ninety percent of all non-climate breaks between 60°S and 60°N (both February and July) are detected prior to 1979, when modern satellite-era atmospheric reanalyses such as MERRA and NCEP’s Climate Forecast System Reanalysis (CFSR; Saha et al. 2010) begin. Figure 10b gives the February and July multivariate median: 10th percentile, mean and 90th percentile non-climate break date for each of the 27 climatic regions considered. The means of these values, taken over all domains, are: 1904, 1934/1939 (February/July), and 1967, respectively. In Fig. 10c, the area-normalized multivariate median non-climate breakpoint count is plotted. By multiplying this value by the number of 2° grids in the domain (see Table 2), the actual median count can be computed. Because the 60°S–60°N non-climate fraction is insensitive to seasonality (Fig. 9a), we expect the non-climate break count to scale linearly with the count of all breaks, which it does. On average, 0.21 domain areas more non-climate breaks are detected in July (mean = 0.82) as compared to February (mean = 0.61) (Fig. 10c).

Fig. 10
figure 10

Boxplot summary of temporal patterns in non-climate breakpoints for a 60°S–60°N (land and ocean combined, except for Q and G, which are defined over land only) and b for each of 27 land- and ocean-only regions. c The area-normalized (i.e., by the contributing grid area, n, provided in Table 2) multivariate median non-climate breakpoint count. In (a) the boxplot whiskers bracket the 10th and 90th percentiles; the circled dot denotes the median. In (b), the multivariate regional median (cyan, orange) and 10th and 90th percentiles (gray, black) are all medians of the set of like (i.e., median, Q10, and Q90) single-variable statistics supported by underlying sample sizes of 50 or more non-climate breaks in the respective region. Thus, each inclusive variable is given equal weight

The box plots and bulk statistics of Figs. 9 and 10 can only go so far towards isolating the inaccuracies of 20CR. Using Fig. 11 it is possible to visually pin-down time windows for each region that deserve greater scrutiny. It illustrates the full 140-year time series of physical and non-climate breaks in T a for February or July- whichever is least homogenous. Red shading indicates the times when breaks are mostly of non-climate origin. Hollow black bars, on the other hand, denote times when natural breaks dominate. A great example of the robustness of our approach is the 1976–1977 climate shift of the North Pacific basin (e.g., Meehl et al. 2009; Powell and Xu 2011), which is properly diagnosed as real (Fig. 11aa).

Fig. 11
figure 11figure 11

For (a) global land (excluding Greenland and Antarctica) and (b-bb) each of the 27 study regions (Table 2), the (hollow black bars) grid count within the 95 % confidence interval (CI) of a breakpoint in T a, the (red shaded area) grid count within the 95 % CI’s of breakpoints in both T a and TMIN ensemble spread, and in (x-bb), the (secondary yellow y-axis) grid count within the triple 95 % CI of the following three SST datasets: HadISST v1.1, ERSST v3b, and COBE SST. In (aw), the number of grids within the 95 % CI of a physical break in CRU T a is plotted on the secondary (blue) y-axis

The results for CUS contrast-but do not completely contradict- our previous assertion that the mid 1940’s breakpoint is unphysical (Ferguson and Villarini, 2012). In this study, a total of 21 breaks are detected in the 1940’s (of which all occur in 1949) and only three are diagnosed as non-climate (Fig. S4e). Accordingly, the mid-1940’s break appears to have competing observational network and climate explanations. In more general terms, this case highlights sensitivity to the choice of statistical test. Recall that previously we applied a hands-on segmented test to the ensemble spread field while we are previously presently applying an automated Bai-Perron test.

It is important to point out that the ratio of non-climate breaks to total breaks in Fig. 11, as well as the absolute count, is inconsistent with previous results (Fig. 10). That is because a different accounting convention was applied. In Fig. 11, breakpoints contribute to the tally in every year of their 95 % confidence interval. For example, if the year lies within the joint confidence interval of breakpoints in both the T a mean and TMIN spread, then the non-climate breakpoint count is increased by one. Alternatively, if the year lies within the confidence interval of a breakpoint in T a mean, but not for TMIN spread, then the physical breakpoint count is increased by one. While it is true that the breakpoint confidence intervals can be very wide (see Sect. 2.3.2), accounting for their lengths is the only way to achieve the true uncertainties inherent to the breakpoint detection. The convention of this study has been and remains to be that of constraining all accounting to the year of the central break date (see Fig. S4). The merit of Fig. 11 is that it informs the precise era of overlap between the confidence intervals of the mean and ensemble spread fields (see Fig. S5 for further details).

Break test results from comparison datasets are also included in Fig. 11. Over land, physical breakpoint count results from CRU T a are plotted. Because there is no equivalent spread field for CRU, breaks in its CDD contributing station time series are used to diagnose non-climate effects. Over ocean, counts of triple coincidence among breakpoint confidence intervals in HadISST v1.1, ERSST, and COBE SST, are plotted as best estimate physical breakpoint counts. Assuming 20CR is skillful, we would expect its physical breaks (hollow black bars) to correspond closely with those of the comparison series. However, tight correspondence only really occurs for MED (Fig. 11p) and NAO (Fig. 11x). The reality, as we shall discuss in the next section, is that these datasets come with their own uncertainties and artificial inhomogeneities, as well.

3.4 Attributing 20CR’s breakpoints

The non-climate shifts identified in this study are more than likely the lowest-hanging fruit per se. We believe that the number of natural (non-climate) breaks is much lower (higher) in reality due to remote (in time and/or space) network effects that are not well captured by the local variable spread upon which our diagnosis is entirely dependent. The teleconnections that translate these signals also carry important implications for homogenization. Namely, that correcting for a single break can have reverberating (and perhaps, unintended) consequences (i.e., either eliminating or giving rise to secondary breaks).

Potential sources of discontinuity include: climate events (e.g., El Niño, severe and extended drought, volcanic eruptions, loss of permanent land/sea ice) and climate change (natural or anthropogenic), model or observational bias, discontinuities in other model data (e.g., specification of greenhouse gas concentrations, volcanic aerosols, ozone, land surface conditions), reanalysis production in multiple streams, and technical mistakes in production. In the case of 20CR, stream production (described in Sect. 2.1) has been shown to affect the continuity of only slow-varying parameters (i.e., integrated column soil water content) at high latitudes (personal comm., Justin Sheffield 2012) that are not the focus of this work.

A more definitive attribution of 20CR’s inhomogeneities than we have provided here will require an immense effort moving forward. But it is a necessary hurdle in the development path towards a climate-quality reanalysis. Specifically, the homogeneity of the observational base for 20CR, which includes HadISST v1.1 and synoptic sea level pressure observations, will need to be reassessed. Long-term independent in situ datasets, of which CRU T a and GPCC P are primary examples, can assist in verifying realism. The difficulty is that each of the data products has their own problems and uncertainties, which are not very well understood. Ultimately, the objective, as depicted in Fig. 12, is to maintain climate variability (Fig. 12a, c, e, g) in the process of correcting for non-climate breaks (Fig. 12b, d, f, h).

Fig. 12
figure 12

Date of the most modern physical [(a, c, e), and (g)] and non-climate [(b, d, f), and (h)] breaks detected in the February and July time series of (ad) T a and (eh) P. For each subplot, the total count of all breaks (not only the most recent) detected over the region 60°S–60°N (land and ocean) is annotated

Figure 13 frames the inhomogeneity of HadISST v1.1, HadSLP2, CRU T a, and GPCC P into perspective with that of 20CR’s (dataset descriptions in Sect. 2). It shows the global breakpoint count map for each of the datasets at their native spatial resolution, computed from their annual mean series. For the purpose of inter-comparison, the total break counts between 60°S and 60°N from the similar analysis conducted at 2° resolution are denoted on each subplot. An abundance of breakpoints are detected in CRU T a, HadSLP2, and HadISST v1.1, whereas GPCC P is found to be relatively homogenous. GPCC P has a homogenous fraction of 0.51 between 60°S and 60°N (land-only), compared to only 0.19 for 20CR P. Moreover, only 15 % of its coverage between 60°S and 60°N is affected by multiple breakpoints. The same statistic is 46 % for 20CR P.

Fig. 13
figure 13

Global breakpoint count maps for the yearly mean a CRU T a, b 20CR T a, c 20CR P surf, d HadSLP2, e GPCC P, f 20CR P, and g HadISST v1.1. The analyses were conducted at the native product resolution for either 20CR’s period of availability (1871–2010) or the datasets period of availability, whichever is shorter (i.e., CRU: 1901–2009; GPCC: 1901–2010; HadSLP2: 1871–2004; HadISST v1.1: 1871–2010). Grids in white are homogenous (at the 5 % significance level) for the period of record. Grids shaded in pink generally denotes a coverage gap, however, it can also indicate permanent sea ice cover in (g), and in (a), the fact that climatology was imposed over a substantial portion of the record. The breakpoint counts provided in each subplot title are the results of a similar analysis (not shown) conducted over 60°S–60°N (land and ocean girds) at a standard spatial resolution of 2° and thus can be directly compared

The fact that so many inhomogeneities persist without definitive attribution, especially in the HadISST v1.1 and HadSLP2, is reason for concern. These datasets constitute critical elements of climate modeling. All model-based reanalyses as well as climate model integrations rely on similar SST/sea ice estimates for boundary forcing. Surface (sea-level) pressure observations inform time-variations in the total mass of the atmospheric column, which the 20CR has demonstrated is sufficient information for producing a skillful reanalysis; they also comprise the earliest meteorological records. The knock-on effect of inhomogeneities in HadISST v1.1 and HadSLP2 to 20CR is apparent in Fig. 13. The weighted pattern (Pearson) correlation coefficient between breakpoint count maps of 20CR T a (P surf) and HadISST v1.1 (HadSLP2) is 0.51 (0.36). The bottom line is that attributing inhomogeneities in these input datasets is an essential prerequisite to attributing inhomogeneities in 20CR itself.

4 Summary and conclusions

Using the Bai-Perron test for structural breakpoints we have shown that the 20CR is affected by artificial “shocks” to an extent that varies on a regional, seasonal, and parameter basis. On a grid point basis, the occurrence of multiple abrupt shifts over the 140-year (1871–2010) record is common. Collectively, the scale order of these abrupt shifts can appear overwhelming; 20,836 change points were detected in 20CR’s yearly 2 m air temperature record between 60°S and 60°N (Fig. 13).

The most important task is differentiating between breaks due to natural climate variability and breaks that are caused by observational network changes. We use the joint confidence intervals of breaks detected in the variable and spread fields to diagnose non-climate (unphysical) shifts, which we demonstrate account for approximately 72 % of all breaks (Sect. 3.3). In reality, the proportion could be even greater due to remote network effects (i.e., via teleconnections). The absolute jump size of the breaks can be sizeable. Seventy-two percent of breakpoints exceed one standard deviation of the preceding series segment, 50 % of breaks exceed 1.3 standard deviations of the preceding series segment, and 25 % of breaks exceed 1.8 standard deviations of the preceding series segment (Sect. 3.2).

On a positive note, a significant fraction of points do exist for which the full record is homogenous (Figs. 2, 3, 4, S2). And of the inhomogeneous records, often the last (most modern) homogenous segment extends back to a very early date. For example, July 2 m air temperature is statistically homogenous (free from non-climate breaks) back to 1942 (1922) for half of all land grids (excluding Greenland and Antarctica; not shown). In some cases, it is possible for the record to be considered natural (or even homogenous) over even longer intervals. This can occur when the jump size of the detected shift is deemed inconsequential to the application at hand (i.e., the change point is disregarded). Accordingly, geographically- and temporally-selective applications of the 20CR for long-term trend analysis are feasible.

The longest pre-existing global atmospheric reanalysis, NCEP-R1 (Kalnay et al. 1996; Kistler et al. 2001), does not even begin until 1948. Hence, and this should be appreciated, the 20CR constitutes a major improvement in series continuity (spatially and temporally) relative to the status quo. Relative to long-term in situ gridded datasets over land that span the twentieth century, 20CR’s degree of homogeneity is comparable to that of CRU 2 m air temperature, but less than that of GPCC precipitation (Fig. 13).

The manifestation of 20CR’s inhomogeneities in time and space, among variables and atmospheric levels, is shown to be highly complex- much like the reanalysis system from which they were generated (e.g., Fig. S3). This makes attributing inhomogeneities to their sources a challenge. Inhomogeneities in 20CR’s boundary forcing (HadISST v1.1) and input data stream (represented by HadSLP2), as well as the comparison datasets (CRU T a and GPCC P), make the process even more difficult. Specifically, we found that abrupt shifts in 2 m air temperature and surface pressure are related to coincident shifts in HadISST v1.1 and HadSLP2, respectively.

Currently, the presence of inhomogeneities confounds the detection and attribution of possible regional climate change signals in 20CR. Our hope is that the results of this work will serve as a valuable resource to 20CR’s broad user group as well as the developers of climate-quality reanalyses such as the planned Sparse Input Reanalysis for Climate Applications (SIRCA; Compo et al. 2012) to span 1850–2014. With sufficiently detailed metadata it is possible to track (to an extent) the propagation of artificial (observational) shocks in the record using the Bai-Perron test (as demonstrated herein) and subsequently correct the affected time series. However, attributing abrupt shifts away from their immediate source (i.e., teleconnections from sea to land) remains a challenge and will require a different approach than we have taken. Another difficulty is diagnosing shifts for which there are competing observational network and climate explanations. The many challenges to attribution suggest that an automated procedure is likely to be insufficient, while the sheer number of inhomogeneities all but necessitates one.