1 Introduction

Coupled models intercomparison projects have shown that major progress in tropical Pacific mean state and El Niño-Southern Oscillation (ENSO) modeling has been made in the CMIP3 generation of coupled models (AchutaRao and Sperber 2006; Meehl et al. 2007; Guilyardi et al. 2009). CGCMs now simulate stable mean state over several centuries without using any flux correction. Meanwhile, ENSO pattern and amplitudes are more accurately represented and its periodicity is closer to observations (Meehl et al. 2007; Guilyardi et al. 2009).

This increasing ability to simulate ENSO has made conceivable the use of CGCMs as predictive tools for the tropical Pacific climate at seasonal time scale, providing valuable societal information (McPhaden et al. 2006; Barnston et al. 2012). However, despite the aforementioned progress, systematic biases are still present in current models on both the mean state and interannual variability (van Oldenborgh et al. 2005; AchutaRao and Sperber 2006; Guilyardi 2006, 2009; Wittenberg et al. 2006; Leloup et al. 2008). Correcting ENSO errors in coupled models has been shown to further improve forecast skill (Gualdi et al. 2005; Luo et al. 2005; Lee et al. 2010; Doblas-Reyes et al. 2009). It has also been argued that correcting model biases occuring at the seasonal time scale can help correct those occuring at longer time scales (Palmer et al. 2008). The tropical Pacific climate has remote influences and reducing ENSO errors would further improve its teleconnections with the other tropical basins and with the midlatitude climate (Power et al. 2006).

Comparison of multi-model ensembles, such as CMIP3, have shown that models exhibit differences of ENSO frequency, amplitude and spatial pattern well beyond the observed diversity of events (van Oldenborgh et al. 2005; AchutaRao and Sperber 2006; Guilyardi 2006, 2009; Wittenberg et al. 2006; Leloup et al. 2008). First, model ENSO range from very regular near-biennal oscillations to spectra that are closer to the observed 2–7 years range. Then, ENSO amplitude varies from less than half to more than twice the amplitude observed. Finally, SST variability is overestimated in the west Pacific and underestimated in the east Pacific in most of the CMIP3 CGCMs; SST anomalies are generally displaced from the upwelling of south America. In addition, only a few models are able to reproduce the asymmetry of ENSO, too many of them exhibiting a skewness near zero in the tropical Pacific. Guilyardi (2006) pointed out that the peak of El Niño and La Niña during boreal winter is often not captured by models which either show little seasonal modulation of the anomalies or a phase-locking to a wrong season of the annual cycle (Kamala et al. 2011).

It is likely that these model shortcomings can be both attributed to errors in ENSO dynamics and to errors in the mean state. As an example of deficient ENSO dynamics, the role of wind anomalies is crucial in setting the growth and the duration of the ENSO events. Capotondi et al. (2006) have shown that wind anomalies in CMIP3 models are generally narrower than observed around the equator and displaced to the west. The reduced meridional extent of wind anomalies drives a lower Warm Water Volume recharge in the west Pacific, whereas the westward shift modifies the advective feedback. Both wind anomaly errors may be responsible for ENSO timescale being shorter in the model than in the observations.

Several studies suggest that a correct mean state in coupled models is a prerequisite to correct ENSO simulation (Fedorov and Philander 2001; Wittenberg et al. 2006; An et al. 2010). For example, Guilyardi (2006) suggests that a too strong seasonal cycle amplitude can lead to too weak El Niño events in several CMIP3 models. The influence of the mean state on interannual variability can occur via the feedbacks that depend on the mean state. For instance, the zonal advective feedback, which is the anomalous advection of the mean SST gradient, is sensitive to the mean isotherm position (Zhang and McPhaden 2010), the thermocline feedback can be too weak if the thermocline is too diffuse, or the atmospheric shortwave feedback SST damping wrong if mean subsidence/convective regimes are not well simulated (Guilyardi et al. 2009).

A number of systematic mean state errors are still shared by many CGCMs. Most prominently, the seasonal equatorial cold tongue extends too far west, is too cold in the east Pacific and is associated with too strong trade winds (Davey et al. 2001; AchutaRao and Sperber 2006; Lin 2007). On the contrary, a warm SST bias is observed near the coast of South America (associated with a lack of low clouds and deficient winds in the region). The depth of the thermocline varies among the models, especially the mean depth and the slope along the equator. These biases might be a consequence of wind errors (van Oldenborgh et al. 2005; Brown et al. 2011). Mean state biases relevant to ENSO also include too strong easterlies in the west Pacific and the double ITCZ syndrome (Lin 2007; de Szoeke and Xie 2008). Identifying the sources of SST errors in the equatorial east Pacific is all the more challenging that the SST in this region is the result of a complex balance of processes. These include for example surface heat fluxes, upwelling and the effect of tropical instability waves (TIW), some of which result from both local and remote air-sea interactions (Wyrtki 1981; Wang and McPhaden 1999; Vialard et al. 2001). In coupled integrations, the Walker circulation rapidly drift in equilibrium with the SST.

Because of these complex interactions, identifying the origin of ENSO and mean state deficiencies in CGCMs requires a specific strategy. Several approaches have been used. (1) Metrics that measure the distance between the model statistics and that of observations can help identify deficient feedbacks (e.g., BJ index, Jin et al. 2006; Kim and Jin 2011). However, further analysis is needed to understand the physical mechanisms involved in the development of these errors; (2) Simpler theoretical models based on CGCMs outputs can also provide insights into the sources of erros (Chen et al. 2000; Boucharel et al. 2009; Roberts and Battisti 2011); (3) Forcing one component of the coupled system with the observations to understand its own biases (atmosphere and ocean-type only experiments) (Biasutti et al. 2006; Lin 2007; Richter and Xie 2008). This can help attribute the origin of biases to the AGCM or the OGCM, without coupled feedbacks; (4) To understand biases arising from the coupling, an ensemble of coupled models which share either an OGCM or an AGCM can be used: common biases help narrow down the component responsible for them (Guilyardi et al. 2004; Lazar et al. 2005).

A shortcoming of the above approaches is that they cannot help explain how coupled errors arise. After the—usually discarded—spin up phase of the long integrations required for robust statistics (Wittenberg 2009; Stevenson et al. 2010), the model drifts toward its new equilibrium and the path taken to reach it is lost. The use of hindcasts, initialized close to observations at regular intervals in the past, allows to overcome this limitation and can help understand the development of errors in the models. The statistical robustness, obtained via the length in the simulations in the other approaches, is now obtained by perfoming ensembles of hindcasts. Hindcasts provide valuable information, as depicted in Fig. 1. If all the members depart together from the observations (B), it is the result of errors in the physical formulation of the model (systematic bias). By comparing the models against one another, we deduct the nature of the systematic errors: either shared by all models or model specific. Seasonal forecasts make it possible to determine the models sensitivity to initialization (C): if members exhibit a large spread, then they are sensitive to small variations of the initial state. The benefits of seasonal hindcasts also come with limitations. The hindcasts errors are not only due to model errors (our quest here), but also to errors in the initial conditions. As also discussed in Palmer and Weisheimer (2011), hindcasts analysis helps to identify errors developing at short timescale, such as clouds and boundary layer turbulence errors, but neither the errors developing after a long leadtime nor the errors due to the approximation of chaotic phenomena by deterministic parameterization (the so-called second kind errors).

Fig. 1
figure 1

Schematic of member behaviour in a seasonal forecast ensemble: (A) ideal forecast, (B) all members depart all together from observations is the signature of a systematic bias, (C) all members spread around the observations denote sensitivity to initial conditions

Seasonal hindcasts have been successfully used to understand how biases evolve depending on leadtime. Lazar et al. (2005) developed a diagnostic based on hindcasts and showed that for a given month the total error could be divided into two parts: the first-month error (due to the initial conditions and the first month of coupling), a shift that persists throughout the simulation; and the error of the remaining months, which evolves rapidly with the leadtime. The first-month error depends on both the ocean and atmosphere, whereas the error of the following months only depends on the atmosphere (and is due to atmosphere physics errors which dominate ocean physics errors). Assessing ENSO prediction in CMIP3/Clipas and ENSEMBLES, Jin et al. (2008), have shown that the accuracy of the simulated interannual variability is related to the accuracy of the simulated climatological mean state. Model forecast skill strongly depends on season, ENSO phase and ENSO intensity. Generally, intense El Niño are better predicted, and the growth phases (corresponding to initialization in August and November) of both El Niño and La Niña events are better predicted than the corresponding decaying phases (initialized in February and May). Forecasting skill for neutral years is lower than those for El Niño and la Niña years. This result is consistent with Barnston et al. (2012), who showed a strong positive relationship between SST standard deviation in the central Pacific and forecast correlation skill over a given period. Finally, Hendon et al. (2009) have shown that whatever the location of the initialised SST anomaly, the POAMA model El Niño tends to drift toward the central Pacific with increasing leadtime, thus preventing the model from developing the observed diversity of events. However, these earlier studies did not use a hindcast approach to fully investigate the origin of model biases.

The present study investigates the origin of coupled model SST biases over the Pacific, with a focus on the equatorial cold tongue errors. In particular, we explore the hypothesis that the surface tropical wind errors, intrinsic to the atmospheric component of CGCMs, are responsible for the subsequent development of SST errors. This hypothesis is tested in hindcasts from the ENSEMBLES-FP6 project. Our objectives are (1) to identify the potential source of SST errors emerging in coupled mode, (2) to explore whether errors in the atmosphere component precede equatorial SST errors, (3) to reproduce the SST errors in ocean only experiments. In Sect. 2, we present the ENSEMBLES models and data used as reference. In Sect. 3, we assess SST mean state and interannual variability in the ENSEMBLES coupled models. In Sect. 4, we investigate the surface wind and the mixed layer depth as the main potential source of SST errors in coupled models, which is further tested by forcing an OGCM with atmospheric errors to reproduce SST errors. The results are discussed in Sect. 5.

2 Data and methods

2.1 Models and hindcasts

The strategy we adopt will be applied to the ENSEMBLES-FP6 project seasonal re-forecasts, stream 2 experiments (Weisheimer et al. 2009). ENSEMBLES provides seasonal hindcasts of the period 1960–2005 made by five European research centers: European Center for Medium-range Weather Forecast (ECMWF), the Institute for Marine Science at Kiel University (IFM-GEOMAR), Météo-France, UK Met Office and the Euro-Mediterranean Centre for Climate change in Bologna (CMCC-INGV), (see Table 1 for model description, model resolution, initialization methods and ensemble generation, further details about models are found in Weisheimer et al. (2009).

Table 1 Ocean and Atmosphere components of ENSEMBLES seasonal forecasts CGCMs and initialization techniques used (from Weisheimer et al. 2009)

Four of the five models (EMCWF, Météo-France, Met Office, INGV) used the same initialization and ensemble generation approaches: an ensemble of three ocean analyses was generated by perturbing the wind-stress and at the start of each coupled hindcast, additional positive and negative SST perturbations were added to obtain nine members. Météo-France added SST perturbations directly to their ocean analyses. The IFM model followed another strategy: three coupled simulations have been restored toward observed SST and the initial conditions for all 9 hindcasts ensemble members were obtained from permutations of ocean and atmosphere initial conditions of these three runs. Three 7-month and one 14-month hindcasts are made every year over the period 1960–2005, starting respectively in February, May, August and November. Nine members are generated at each start date using the techniques described above.

A seasonal cycle is reconstructed after leadtimes of 0,1,…,4 months using the respective forecast periods, 1st–3rd, 2nd–4th, …, 5th–7th months, to follow the evolution of the seasonal cycle and mean state biases.

2.2 Observations and reanalysis

The leadtime dependent error will be defined as the difference between the hindcasts and a reference. SST are compared to HadISST up to 1981 (Rayner et al. 2003) and Reynolds analysis afterwards (Reynolds et al. 2002), both used in ERA40 reanalysis. 10 m wind velocities come from ERA40 reanalysis (Uppala et al. 2005) which covers the period range 1958–2005. The cloud cover is given by ISCCP, based on spatial observations from 1984 to 2002 (Rossow et al. 1996). The OAFlux dataset provides longwave and shortwave downward heat fluxes produced by ISCCP, and latent and sensible turbulent heat fluxes estimated by OAFlux reanalysis (Yu and Weller 2007). The latter two are available for the period 1965–2004, whereas the former two are available only after 1984. OLR dataset is from NOAA observations (1975–2006) (Liebmann and Smith 1996). Precipitations used are from CMAP reanalysis (Xie and Arkin 1997), available for the period 1979–2006. The ECMWF Ocean Reanalysis system 3 OR3 (1959–2005) (Balmaseda et al. 2008) provides reference fields for thermocline and mixed layer depth.

3 Mean state and interannual variability in the tropical Pacific

3.1 Mean state over the tropical Pacific

Figure 2 presents the mean SST error in the tropical Pacific after a forecast time of 5–7 months for the period 1960–2001. Four of the five models share the classical cold SST bias in the central Pacific along the equator (up to −1.5 °C). The ECMWF model is the only one which does not display any cold bias in the central part of the basin. All the models have to some extent a warm bias in competition with the cold one in the far eastern part of the basin. This warm bias has been attributed to several factors in the literature. First it can be due to an underestimation of southerly coastal wind, resulting in a weaker than observed upwelling. The second hypothesis is a lack of low cloud in the south-east Pacific due to insufficient sensitivity of stratus clouds amount to SST (Philander et al. 1996; Lin 2007). Finally, it can be a misrepresentation of horizontal eddies activity that transports cold waters away from the coast Echevin et al. (2011).

Fig. 2
figure 2

SST mean state errors over the tropical Pacific for the five models of ENSEMBLE after a 5-month lead time a ECMWF, b IFM, c Météo-France, d Met Office, e INGV. Reynolds SST have been taken as reference and is plotted in the f

The SST mean errors in the cold tongue region are established in a few months, as shown by Fig. 3. In this figure, the root-mean square error (RMSE) of the SST ensemble mean averaged in Niño 3.4 is plotted as a function of leadtime. We chose to compute this diagnostic in the central Pacific, to assess the central cold SST bias and not the east warm bias. This quantity has been averaged for the four start dates (February, May, August and November) over a period of 40 years (climatology). The ensemble mean RMSE is the sum of the systematic bias (defining ”climatological errors”) and the standard deviation of ensemble mean errors (giving the imprecision of the forecast). For all models except ECMWF, the curves tend to level off after a few months. It takes only 2 months in IFM and Météo-France. The existence of a plateau in many of ENSEMBLES models suggests that the mean state biases in the Equatorial Pacific are mainly set prior to the 7 month leadtime. It also justifies our approach to use the seasonal cycle reconstructed with 5–7 months leadtime as a proxy of the stabilized mean state, at least for the SST and in this region. Furthermore, the 5–7 months forecast period falls on different times of the year and depending on the start date, the bias may evolve differently with increasing leadtime. The seasonality of the SST bias will be discussed later.

Fig. 3
figure 3

Niño 3.4 SST RMSE errors for each ENSEMBLES models computed against Reynolds. RMSE has been averaged each time, for the ensemble, for the four start dates and over the 40-year time series. Dark blue ECMWF, green IFM-Kiel, Red Météo-France, light blue Met Office, purple INGV

In Fig. 4, we present the equatorial Pacific mean state errors for the five models (from top to bottom, SST, zonal wind, thermocline and precipitation depth). Two different behaviors, which follow the SST biases, can be seen. We first detail the Met Office model biases, as an example of a model having a cold bias. The major surface wind biases are confined to the west part of the basin, where the zonal velocity is 3 m s−1 stronger than in ERA40 (Fig. 4b), so that the mean zonal wind is outside the range of the observed interannual variability (gray shading). The too strong trade winds have two effects contributing to the SST cooling: first, they tend to shoal the thermocline in the central to eastern Pacific (Fig. 4c), which at the equator is up to 40 m shallower than in OR3; second, they induce a too strong latent heat flux, which cools the surface waters in the western and central Pacific (not shown). There is a lack of precipitation over the entire Pacific basin in the Met Office model (Fig. 4d), and convection is shifted over the maritime continent, which is coherent with the results found by Meehl et al. (2001) in a multi-model study. IFM, Météo-France and INGV exhibit the same dynamical errors as in Met Office model. However, zonal wind in Météo-France is weaker by 1 m s−1 compared ERA40, suggesting that the cold bias is not always associated with an enhancement of the Walker circulation.

Fig. 4
figure 4

Mean state errors after a 5–7-month leadtime averaged between 2°N and 2°S, a SST (°C vs. Reynolds), b 10 m zonal wind (m s−1, vs. ERA40), c 20C Isotherm depth (m vs. OR3), d precipitation (mm day −1, vs. NOAA). Dark blue ECMWF, green IFM-Kiel, red Météo-France, light blue Met Office, purple INGV. The gray shading represent the anomaly standard deviation in observations and analyses

On the opposite, the warm bias displayed by ECMWF is associated with slightly too weak easterlies in the western part of the basin and slightly too strong easterlies in the eastern part of the basin. The thermocline slope is too flat (deeper in the east and shallower in the west). There is too much precipitation over the equator (Fig. 4c), the ITCZ being shifted southward compared to observations (not shown).

The zonal wind averaged in central and East Pacific is presented in Fig. 5. It exists among models a large variety of zonal wind seasonal cycle in the Pacific, IFM model being the only one close to observations. ECMWF, Met Office and INGV have too strong trade winds. The two formers, though in phase with the observations, overestimate the amplitude of the seasonal cycle. On the contrary INGV simulate wind relaxation in the east Pacific when zonal wind maximum is observed. Meteo-France’s zonal wind is too weak in eastern Pacific all the year round.

Fig. 5
figure 5

Zonal wind (m/s) averaged between the dateline and 90°W, 5°N and 5°S. Black ERA40 10 m-zonal wind, dark blue ECMWF, green IFM-Kiel, red Météo-France, light blue Met Office, purple INGV

SST, surface winds, thermocline and precipitation biases are linked to an overestimation (underestimation) of Walker circulation in Met Office (ECMWF) model. So far, it is not possible to state whether these errors evolved jointly in the ocean and in the atmosphere as the result of joint drift due to the coupling, or whether they can be attributed to a preexisting error.

3.2 Interannual variability

The equatorial SST and zonal wind anomalies standard deviations in the ENSEMBLES models are evaluated in Fig. 6. In hindcasts, the standard deviation is based on a 5 month lead 3-month mean (including the 5th, 6th and 7th month). Anomaly standard deviations are computed for each of the 9 members over 40-year hindcasts and then averaged in the ensemble direction. SST standard deviation measures the amplitude of ENSO, and the zonal wind standard deviation evaluates the ability of the atmosphere to respond to the SST anomalies and subsequently to sustain them. We added on Fig. 6, the area (gray shading) in which model standard deviation is not significantly different from the observed standard deviation.

Fig. 6
figure 6

a SST and b Zonal 10 m-wind anomaly standard deviations after a 5–7-month leadtime averaged between 2°N and 2°S along the equator for all start dates. Black Reynolds SST and ERA40 10 m-zonal wind, dark blue ECMWF, green IFM-Kiel, red Météo-France, light blue Met Office, purple INGV. Outside the gray shading, the model anomaly standard deviation is significantly different from the observed standard deviation at the 0.99 level using a F-test

Reynolds SST standard deviation reaches its maximal value in the east Pacific and forms a relatively constant 1.3 °C plateau from 140°W to the Peruvian coast. In ERA40, a maximum of zonal wind standard deviation is observed at 165°E and reaches 2 m s−1 (Fig. 6b).

The SST standard deviation is maximum far from the coast in all models and peaks near 120°W in ECMWF, IFM, Met Office and INGV. In the east Pacific, IFM, Met Offfice and INGV overpredict by 0.3 °C the interannual variability whereas no significant error is observed in ECMWF. In Météo-France, INGV and marginally in IFM, the SST variability extends too far to the west. These models are those which develop a cold bias in the central Pacific.

The maximum of zonal wind standard deviation (Fig. 6b) is shifted from 10° to 20° westward in the five models. The zonal wind standard deviation in the Met Office model, 1.2 m s−1, is approximately half the observed one and is significantly different from observed (Fig. 6b). Zonal wind standard deviation in all the other models range from 1.7 m s−1 to 2.3 m s−1 and is too weak in the central Pacific (from the dateline to 140°).

3.3 Atmospheric responses to SST anomalies

We now try to further interpret the SST standard deviation errors in the ENSEMBLES models. Several studies have described SST anomalies diversity based on feedback computation. In particular, the BJ index successfully described SST process-based anomalies amplitude in CMIP3 models (Kim and Jin 2011). Two atmospheric feedbacks involved in ENSO dynamics have been shown to exhibit significant errors in CMIP3 CGCMs (Lloyd et al. 2009, 2011), first the atmospheric part of the Bjerknes feedback and second, the shortwave feedback. The atmospheric part of the Bjerknes feedback is computed as the regression of zonal wind anomalies in the west Pacific (Niño 4) on an anomaly of SST in the east Pacific (Niño 3). The complete Bjerknes feedback also include the response of the ocean in turn to the wind change in the west Pacific, however for brievety it is loosely called Bjerknes feedback in the following. The shortwave feedback is computed as the regression of shortwave heat flux anomalies to SST anomalies both averaged in (Niño 3). According to Lloyd et al. (2011), it represents the main contribution to the total heat flux feedback errors. Since the adjustment is faster in the atmosphere than in the ocean, feedbacks errors are likely to appear first in the atmosphere. In the following we present the Bjerknes and shortwave feedbacks (Figs. 7, 8) for the 5–7 month forecast time (i.e., for each month of the year, the feedback value was calculated using the longest leadtime available, over the 40 year time period.)

Fig. 7
figure 7

Annual evolution of Bjerknes feedback for 5–7 month forecast time. Black reference computed with ERA40/Reynolds, dark blue ECMWF, green IFM-Kiel, red Météo-France, light blue Met Office, purple INGV

The Bjerknes feedback is stronger during the second part of the year when ENSO anomalies are amplified (Fig. 7). The main Bjerknes feedback bias in models seems to reside in its wrong seasonal cycle: the weakest feedback values being in spring instead of boreal winter as observed. The Met Office has the largest dip in spring. IFM model has a seasonal cycle closer to observation but the amplitude of the seasonal cycle is underestimated. Overall, the Meteo-France model seems the best considering both amplitude and phase. The underestimation of the Bjerknes feedback value may have several origins: the relationship between wind in the west and SST in the east may be weakened, the interannual viariability of wind may be damped by another error mean state error and finally the wind response to SST may be shifted toward the west like wind interannual variability as suggested by Fig. 6b.

In observations, the shortwave feedback, computed with Reynolds and OAFlux, is mainly negative (Fig. 8): a positive SST anomaly tends to decrease the heat flux anomaly, which, in turn, cools the SST. The shortwave feedback value is higher during the first part of the year (up to −30 W m−2 s−1 in March) and is weaker (−5 W m−2 s−1) in the second part. It has been shown based on CGCMs that the sign of the feedback depends on the dominant dynamical regime in the east Pacific Guilyardi et al. (2009) and Lloyd et al. (2011). During an ascent regime in the eastern Pacific, the convective activity increases with SST and high clouds generate a negative feedback on SST. Conversely, during a subsidence regime, an increase of SST develops a shallow convection, reduces boundary layer stability and breaks the strato-cumulus clouds. In this case, the shortwave heat flux exerts a positive feedback on SST in Niño 3.

Fig. 8
figure 8

Annual evolution of shortwave response to SST in Niño 3, for 5–7 months forecast time. Black reference computed with OAFlux/Reynolds, dark blue ECMWF, green IFM-Kiel, red Météo-France, light blue Met Office, purple INGV

ECMWF and Météo-France simulate the right amplitude of the shortwave feedback (Fig. 8), whereas it has too weak values all along the year in IFM, Met Office and INGV. In the later three models, the strong cold bias is associated to too strong subsidence regime in the east (not shown). This may explain the positive value of the shortwave feedback most of the year in IFM, Met Office and INGV and especially in the second part of the year, instead of the negative observations value.

The values of these two atmospheric responses can help explain the intensity of the SST standard deviations deviations displayed in Fig. 6a. In ECMWF, the SST anomalies are damped when the negative shortwave feedback is overestimated especially from May to December. In IFM, INGV and Met Office the weak values of the shortwave feedback makes up for the underestimated Bjerknes feedback which causes standard deviation to be too strong in the east as described above.

Lloyd et al. (2011) have shown that the Bjerknes feedback has correct value in AMIP simulations but tends to be weak in CMIP3 models. We can investigate with hindcasts how fast this is happening. In Table 2, we report the time after which the feedbacks values have reached 75 % of their final values. These thresholds are reached for most of the feedbacks in less than 3 months. Surprisingly, some models have reached a value of the shortwave feedback close to the final one after only 1 month: ECMWF, IFM and INGV even if the mean SST biases are not stabilized (as shown in Fig. 3). This suggests that some of the atmospheric mechanisms modulating ENSO are set from the first month in the simulation, even if the ENSO characteristics continue to respond to the identified feedback at longer lead time.

Table 2 Leadtime after which Bjerknes and Shortwave feedbacks have reached 75 % of their final value (at leadtime of 7 months)

4 Mechanisms of SST errors growth

Errors on the ”mean” state can come from both rectification of the interannual variability and/or mean errors. On the one hand, if one of the phases (El Niño or La Niña) of ENSO is not simulated with fidelity, by construction the mean state will be biased as well. On the other hand, if a physical process is not accurately represented, integration of the bias may also result in mean state errors. In the following part we first assess the representation of El Niño and la Niña events and notably how the cold equatorial biases develop during each phase. Then, we show that mean state surface wind and mixed layer depth biases are likely to explain the development of the cold bias.

4.1 ENSO composites

The Niño 3 SST El Niño composites are shown in Fig. 9 over 2 years (respectively onset and decaying year). The dark blue, green, red and light blue curves represent the 7-month ensemble mean hindcasts simulations respectively initialized in February, May, August and November. black curve is ERA-40 reanalysis. Gray shading and colored thin lines denote the maximum and minimum in the composite respectively in the observations and in the hindcasts. The mean seasonal cycle of each model constructed after 5 to 7 months forecast time is added (brown). When the black curve is over (below) the black dashed curve, El Niño (La Niña) is observed. Similarly, when the color curves are over (below) the brown curve, El Niño (La Niña) event is simulated by the model.

Fig. 9
figure 9

Monthly SST in Niño 3 [150°W 90°W 5°S 5°N], for a composite of El Niño years [1967, 1972, 1982, 1987, 1997] in the five models and in observations. The black curves stands for the Reynolds reference. Blue curves are the simulations initialized in February, green is for May, red for August and pale blue for November. The gray shading denote the temperature range (computed as min/max) of the observed composite and the colored thin lines give the same information for hindcasts. The brown curve is the reconstructed seasonal cycle after a leadtime of 5 months. The black dashed curve is the observed seasonal cycle. From left to right and from top to bottom ECMWF, IFM-Kiel, Météo-France, Met Office, INGV

The predicted SST in ECMWF model never deviates more than 0.5 °C from the observations. However, it underestimates the maximum SST in November when initialized in May by 1 °C (upper green thin line vs. gray shading). In the four other models, results strongly depend on both the initialization month and the leadtime.

The SST evolution in Met Office and INGV models is very similar: models are initialized close to observations for all start dates but exhibit a strong drift in the months following February and May start dates. Met Office and INGV miss Niño 3 SST by respectively 2.5 and 1.5 °C in November when initialized in May. In these two models, during the El Niño onset and for a given date, the bias increases with the leadtime: considering for instance the August case in the Met Office model, the bias after a month of simulation (red curve) is about 0.5, 2 °C after 4 months (green curve) and 3 °C after 7 months (blue curve). For both models SST decrease is the same for February and May simulations and the model seasonal cycle, suggesting similar mechanisms for the error development, which depend on the model seasonal cycle. Finally, the El Niño SST is correctly predicted only when models are initialized in November, i.e. when the event is already underway.

Whatever the model considered, when initialized in August, after 7 months leadtime, the hindcast reaches the right amplitude of El Niño, in spite of initialization errors (IFM, Météo-France) or an initial SST decrease (Met Office and INGV), suggesting an error compensation.

The Niño 3 SST La Niña composite is shown in Fig. 10. In the ECMWF hindcasts, a warm bias up to 1.5 °C develops, regularly, with increasing leadtime. In Met Office model, although SST is up to 1.5 °C too cold from August to November, the SST bias is weaker than in El Niño composite all over the year. INGV model gives the best results for the la Niña phase, hindcasts always matching Reynolds SST until December. However, when initialized in November of la Niña years, the INGV model can’t return to the seasonal cycle (compare the light blue curve with dashed/solid black curves.) and stays in a cold anomaly.

Fig. 10
figure 10

Same as Fig. 9 but for a composite of La Niña years, namely [1964, 1970, 1971, 1973, 1975, 1984, 1988, 1998, 1999]

The hindcast initialized in November (light blue curve) makes it possible to assess the decay of an El Niño / La Niña event. In observations, an El Niño year tends to be followed by a La Niña event (compare solid and dashed black lines): as theorized by the recharged/discharged oscillator mechanism (Jin 1997) (black line on Fig. 9). ECMWF (light blue line vs. brown line on Fig. 9) correctly simulates a cold anomaly after a warm event, suggesting that this model captures the discharge oscillator mechanism. On the opposite, IFM and Met Office predict a neutral year and Météo-France a second El Niño year. After La Niña (Fig. 10), ECMWF and Met Office predict a neutral year like in the observations whereas IFM and Météo-France predict an El Niño event.

We point out that the cold SST bias along the equator and the warm bias in the east Pacific may cancel each other when averaging in Niño 3. This might be the case in IFM-Kiel and INGV models where the warm bias extends farther west than in the other models (Fig. 2).

Strong errors from the first month, attributed to incorrect initializations are mainly observed in IFM and Météo-France model for August and November start dates. IFM is 1.5 °C too cold from the first month of simulations initialized in August, for both El Niño and La Niña composites. This may be due to the initialization technique used: IFM SST nudging may not provide accurate enough subsurface initial conditions, the model being consequently too sensitive to upwelling. In Météo-France, there is a 1 °C cold bias in the composite for almost all the hindcasts, reached after less than 2 months of simulation. But this model is able to reproduce the SST amplitude of strongest events when initalized in May and August (the upper green thin curve match the upper bound of gray shadings), indicating that there are no systematic bias in this model. The minimum in the composite is close to the model seasonal cycle from the first month: the misrepresentation of the weakest El Niño events may be responsible for the underestimation of the SST composite mean.

As a summary, models do not simulate with the same accuracy both El Niño and La Niña onset (February and May start date). ECMWF and Météo-France are more skillful at simulating El Niño onset, whereas IFM, Met Office and INGV display better skill for La Niña growth. The same conclusion can be drawn for the decay of events when initialized in November: INGV prediction matches the observed El Niño decay but largely fails to come back towards its mean seasonal cycle during the La Niña decay. Due to this asymmetric skill, the stronger cold equatorial errors associated with El Niño carry more weight than La Niña in the Met Office and INGV models mean state errors from June to December. The La Niña erroneous decay has a stronger weight in INGV mean state errors from February to June than those associated with El Niño. Whether they occur during El Niño or La Niña the mechanisms likely to cause the SST errors are not the same.

4.2 Investigating the origins of the mean equatorial cold bias

As the cold equatorial bias is most prominent during El Niño onset, we now use El Niño composites to further investigate its origin. Our analysis identified mean state surface winds and/or mixed layer depth errors as the likely source of the cold bias emerging from February and May in Met Office and INGV during El Niño years as involving mean state surface winds and/or mixed layer depth errors.

4.2.1 Surface winds

In observations, the strength of zonal wind at the equator for a composite of El Niño events in May is reduced compared to the seasonal cycle (Fig. 11—compare solid and dashed black curves). This decrease occurs exclusively west of 120°W. The equatorial zonal wind in May is shown after a leadtime of 1 week (light blue), 1 (red), 3 (green) and 7 months (dark blue). We added in brown the seasonal cycle after a leadtime of 7 months. The structure of the wind east of 150°W shows little differences between the composite of El Niño events (at all leadtimes) and the model’s mean seasonal cycle. Wind is generally too strong by 1 to 2 m s−1 from the first week in all models east of 120°W and even more so for INGV, for which wind is 2 m s−1 too strong compared to ERA40 from the first week. In Météo-France and Met Office models, winds are slightly too weak in the central part of the basin. Wind errors in the east and central Pacific persist after 7 months (the difference between dark blue and black curve—interannual variability—is similar to that of the brown with the black dashed curve—mean state errors). In the west Pacific however, and for the IFM, Met Office and INGV models, the easterlies intensify gradually with increasing leadtime. This behaviour can be interpreted as the Bjerknes response of surface wind to the increasing cold SST bias in the east Pacific occurring in these models. Two types of zonal wind errors arise: the east Pacific errors which do not depend on leadtime and are also present in the mean state, and the western Pacific errors which are leadtime dependent and are likely to evolve jointly with the SST bias.

Fig. 11
figure 11

El Niño composite of monthly mean zonal wind at 10m (m/s) along the equator averaged between 2.5°S and 2.5°N for the month of May after a leadtime of (light blue) 1 week, (red) 1, (green) 4, 7 months and for ERA40 in black. The brown curve is the 40-year averaged mean zonal wind after a leadtime of 7 months (which is taken for the mean state. The black dashed curve is ERA40 May climatology

4.2.2 Mixed layer

In Fig. 12 we present the evolution of the equatorial mixed layer depth in the east Pacific for a composite of El Niño years. In the OR3 reanalysis the mixed layer in Niño 3 deepens seasonally from 30 to 50 m from March onwards in association with the flatening of thermocline due to easterlies velocity decrease (black dashed curve). During El Niño years, the mixed layer depth deepens even more (black solid curve) associated with the eastern Pacific heat content increase. ECMWF is close to the reanalysis, for all the start dates. One must however bear in mind that the OR3 reanalysis is a forced ocean reanalysis made using the same ocean component as in ECMWF coupled model. For the remaining models, a severe 20–40m bias is present from the first month and the hindcasts stay close to their mean seasonal cycle, missing the El Niño event.

Fig. 12
figure 12

Same as Fig. 9 but for mixed layer depth (m) in box [150°W 90°W 2°S 2°N]. OR3 is taken as reference. Météo-France is missing

Errors on the mixed layer depth may arise both from initialization or model physics, such as the parameterization of the eddy viscosity and the vertical eddy diffusivity, and from the coupled system mean state, trade winds (entrainment) and heat fluxes (stratification). Since the position is too shallow for all start dates after only 1 month of simulation, even for November and considering that no wind bias develops systematically for all the start date in the equatorial pacific, one can conclude that the bias of mixed layer depth exists independently of the zonal wind velocity bias. The response time of SST errors to other biases will depend on the mixed layer depth as SST errors develop faster when the mixed layer is shallow.

So far we have shown that wind errors are likely to explain the initial cold drift, since zonal wind biases in the east Pacific are present from the first week while no SST bias has yet appeared. Mixed layer depth biases can amplify this drift, since wind driven SST changes occur faster when the mixed layer is shallow.

4.2.3 Significance of zonal wind and SST errors

Because the wind and the SST are strongly coupled, we examine their daily evolution from the initial state of the May start, to investigate to what extent the wind bias precedes the SST bias.

In Fig. 13, we present time-longitude diagrams of differences between individual model ensemble mean SST and zonal wind and the multi-model ensemble mean for the first 30 days of the hindcasts. Taking the model ensemble mean as reference both extends the sample to 45 hindcasts (nine members from five models), making it possible to compute significance statistics, and allows to understand the differences between models. Significance thresholds of differences between these structures were computed using the method described in Terray et al. (2003).

Fig. 13
figure 13

Difference between models daily SST (left) and daily zonal 10 m-wind (right) and the multi model ensemble mean. The first 30 days of simulation initialized in May for an El Niño onset composite is considered. From top to bottom ECMWF, IFM, Météo-France, Met Office and INGV. Non-hatched areas denote significative difference at 0.99 confidence level

As already seen in Fig. 11, in the central and eastern Pacific, the zonal wind biases reach their maximum after the first week of the hindcasts launch. Moreover, even though the maximum wind bias longitude is shifted westward when the leadtime increases, the amplitude of the errors remains of the same order. On the opposite, significant SST errors appear only after a few weeks for most models but take several months to fully develop, as described above.

Over a large part of the equatorial domain, wind differences are significant from the beginning of the hindcasts and in particular the strong easterly anomalies in the INGV model. The same test on the SST field shows that the anomalous structures are significant only after at least 15 days in the INGV model.

4.3 Simulating the AOGCMs cold bias in ocean-only simulations

To further show that zonal wind bias is responsible for SST errors development, ocean only experiments are forced with the ENSEMBLES AOGCMs winds. A control run (REF) using the NEMO/ORCA2 OGCM (Madec 2008) forced with the bulk formulation by the Drakkar Forcing Set 4 (DFS4, Brodeau et al. 2010), provides the common initial state of the forced simulations. DFS4 is an ERA40 and NCEP based product, corrected notably by using satellite observations (QuickScat for the wind) and by removing spurious temperature and humidity time discontinuities. The 6-hourly DFS4 winds have been daily averaged to filter high frequency wind structures so that the forcing remains comparable to the ENSEMBLES daily data.

The simulations are initialized May 1st for six El Niño years (1972, 1982, 1987, 1992, 1994, 1997), maintaining DFS4 bulk forcing but replacing 10m winds by those simulated in the ENSEMBLES hindcasts initialized at the same date. These ocean-only simulations are referred as Dyn-HF (both dynamical and heat fluxes effects). However, the ERA40 2m-temperature used in turbulent fluxes bulk formulas, tends to relax the SST toward observations, damping the development of SST errors. To override this restoring, we also carry out fixed fluxes simulations, referred as Dyn-only (dynamical effect only), in which the heat fluxes prognostically computed by bulk formulae are replaced by the fluxes diagnosed from the control run (REF). In other words, in the Dyn-only experiments, the wind variations only modify the SST dynamically and not via the latent heat flux.

To maintain statistical robustness, and as, for each model, the corresponding ENSEMBLES and ocean-only simulations do not share the same initial state, we continue using the multi-model ensemble mean as reference. The equatorial SST differences at day 30 of ENSEMBLES, Dyn-HF, Dyn-only experiments with their respective multi-model ensemble means are shown in Fig. 14.

Fig. 14
figure 14

SST difference after a 30-day integration between each model and the model ensemble mean for (blue) forced experiments (Dyn-HF), (green) forced experiments with fixed fluxes (Dyn-only) and (red) ENSEMBLES

At the end of the first month, the main structures of SST differences between the coupled models are also present in the ocean-only Dyn-HF simulations. The large cooling at the equator in the INGV model is quite well reproduced both in pattern and amplitude. A slight warming is simulated in Met Office Dyn-HF experiment as in the coupled model. The comparison is less conclusive for ECMWF and Météo-France suggesting that coupled model heat fluxes errors, which are not included in the forcing, can also play a role in the development of the cold bias in these models.

The SST simulated in Dyn-only is close to that obtained in Dyn-HF experiments. This indicates that east Pacific SST errors are mainly due to wind-driven entrainment rather than to the wind-induced latent heat flux, as seen to occur in the West Pacific. These experiments suggest that during the first month of hindcast the wind is responsible for the main SST errors patterns via its contribution to upwelling of cold water from below the thermocline. We verified that the differences between models for individual El Niño events were the same as those for the composite El Niño (not shown), suggesting the robustness of the mechanisms initiating SST differences between models.

We further investigate the difference between the influence of local and remote wind biases. The SST errors at the equator may either be due to local wind errors, by the modulation of the upwelling, or due to a misrepresentation of wind response in the west Pacific leading to a wrong Bjerknes feedback. The wind in the west Pacific is too strong in Met Office and INGV models Fig. 11 and the lack of Bjerknes feedback in these models likely participates to the development of the cold tongue SST bias. However, despite the fact that the zonal wind has the same amplitude in the west Pacific in both Met Office and INGV, the cold SST bias develops only in the latter during the first month. Moreover, a response of the thermocline to zonal wind changes in the western Pacific will take 2 months to reach the east Pacific via a Kelvin wave, which is longer than the time necessary for the cold bias to develop in the INGV model. On the other hand, the Met Office and INGV models respectively show too weak and too strong zonal wind in the central Pacific in May, which is coherent with the respective SST biases developed.

We estimate the cooling for a 0.5 m s−1 enhancement of the upwelling, as is simulated by the INGV model at the equator in the central Pacific (180° and 130°W, 1°N and 1°S). The corresponding anomalous upwelling is estimated by the term \(\langle w \rangle\,\langle H(w) \frac{\partial \overline{T}}{\partial z} \rangle\), where w denotes the anomalous upwelling and \(\frac{\partial \overline{T}}{\partial z}\) the seasonal temperature vertical gradient of the initial state of the simulation. The seasonal vertical temperature gradient estimated from REF in the equatorial central Pacific (same definition as above) at 50 m depth is 0.054 K m−1 and the resulting estimated cooling in the mixed layer after 1 month is 0.7 K. This is consistent with the 0.5 K cooling observed in INGV model.

This analysis shows that local wind biases in the central and eastern Pacific, present from the beginning of the simulation, are likely to be responsible for the SST biases set during the first month of the simulation. Wind biases in the west Pacific may play some role in the development of cold errors but at longer leadtime, allowing for Kelvin wave propagation time. Moreover, Fig. 5 shows that the zonal wind simulated in MetOffice is stronger than in ERA40 in the central and west Pacific in June but not in May. This could explain why no cold bias is produced at leatime 1 month in May in Met Office hindcasts and Dyn-HF experiment.

5 Summary and discussion

This study analyses the development of tropical Pacific mean state and interannual variability biases in AOGCMs. The ENSEMBLES set of hindcasts, including five coupled models, is analyzed. Four of the five models (IFM-Kiel, Météo-France, Met Office and INGV) develop a cold mean SST bias along the equator after a 5–7 months forecast time, whereas the ECMWF model does not exhibit any SST cold bias in the central equatorial Pacific (Fig. 2). This cold bias is representative of the cold tongue bias which appears in CMIP3 centennial simulations (Guilyardi 2006; AchutaRao and Sperber 2006; Lin 2007). A SST initialized close to observations, and the subsequent transient drift towards the cold equatorial bias make these simulations an appropriate tool to understand the origin of this persistent bias in AOGCMs. The comparison of models developing a cold bias with the ECMWF model allows to better understand the mechanism leading to the equatorial Pacific spurious cooling.

The SST cold bias in Niño 3 has a strong seasonality which varies among the models (Fig. 9). Some models (IFM, Météo-France, INGV) underestimate the observed Niño 3 SST warming caused by the trade wind relaxation, from January to May. Some others (INGV, Met Office) have a too strong SST cooling during the emergence of the upwelling, from June to December. The analysis of the El Niño and la Niña composites reveals that the cold bias does not develop symmetrically during ENSO. The cold SST bias develops mainly during El Niño onset when models are initialized in February and May. On the opposite, in the INGV model, the cold SST bias developing in the first part of the year is stronger during la Niña decay than during El Niño, INGV failing to recover the seasonal cycle.

The cold bias developing in February and May shares similarities among the models: it occurs at several consecutive start dates (February and May), and is stronger during El Niño onset than during la Niña. We focus the analysis on the cold bias in May since the bias develops rapidly after the initialization; its appearance during the start of the upwelling suggests a dynamical adjustment due to too strong trade winds.

We identify two types of zonal wind errors in the hindcasts that are initialized in May during El Niño onset. First, the errors developing in the eastern and central part of the basin, present from the beginning of the simulation which are mean state biases (Fig. 11). These errors are significant from the first week (Fig. 13), although during that time no significant bias is observed in SST. Second, the errors developing in the western part of the basin (Fig. 11), which increase with leadtime and are shown to be a response to the developing SST bias (Fig. 9). Moreover, models developing a strong cold bias when initialized in February and May are those which have a shallow mixed layer depth (Fig. 12). While the zonal wind errors in the east of the basin are shown to be responsible for the cold bias, mixed layer depth errors are likely to amplify them.

This precursor role of east Pacific zonal wind errors is further confirmed by carrying out ocean-only experiments. Applying the coupled models surface wind to the same OGCM is enough to reproduce the SST biases seen in several ENSEMBLES AOGCMs (INGV, Met Office, Météo-France, Fig. 14). It is further shown that the cold SST bias is mainly due to the dynamical impact of the wind, while heat flux errors may also play a role in some models. The experimental protocol using 30 day long ocean-only experiment is appropriate if the wind did not have time to adjust to the SST biases. Indeed, in order to argue a causal link between the zonal wind bias and the SST bias, it is necessary to ensure that the SST didn’t have time to feedback and enhance the zonal wind errors. Very few studies investigate the time scale of such a basin-scale response. We speculate that it is the sum of the time needed to develop a significant SST anomaly (15 days as seen above), the time needed for SST to warm the atmosphere above (a few days maximum) and the characteristic time of the Gill-Matsuno response of the large scale circulation to an anomalous diabatic heating in the atmosphere (1 week according to Jin and Hoskins 1995). Even though more work is needed to fully conclude, we consequently argue that it takes at least 3 weeks for the zonal wind to be modified by the SST errors.

Key mechanisms explaining the advent of biases in models occur during the transition from the initial state to stabilized mean state. This transition phase is often not analysed in long model integrations. Hindcasts offer a powerful way to understand the cause of model drift, to distinguish preexisting biases from the evolving ones and finally to establish a chronology of errors development.

Results presented here also suggest a role for heat fluxes in generating SST biases in the equatorial Pacific, especially for interannual variations. The intensification of the Walker cell associated to the cold equatorial bias leads to modified heat flux feedbacks, especially in the east where a permanent subsidence regime does not allow the damping of interannual SST anomalies by the convective regime, like in observations (Fig. 6, Lloyd et al. 2011). Intriguingly, the intensity of the shortwave heat flux feedback in the east Pacific is set after only 1 or 2 months even though the cold SST bias continues to develop for several months. More work is needed to understand the origin of the zonal wind bias in models. Lin (2007) has for instance shown that too strong precipitations exist in atmosphere-only simulations and argues that this intrinsic AGCM biases may be responsible for the too strong trade winds developing in coupled models. Hence, understanding the dynamical and thermodynamical mechanisms that drive the tropical atmosphere is required both to alleviate AOGCM errors and to describe the full extent of the atmosphere role in tropical variability, such as ENSO. For that purpose, AGCM only short simulations could be perform in the same way as ocean only experiments done in this paper, to see whether and how wind biases can develop in atmospheric models.