1 Introduction

From the icy, wet winters along the Great Lakes, to the hot and dry summers in the Southwest, the United States (U.S.) experiences a wide range of climatic extremes. These extremes create unique challenges when forecasting the weather. Understanding forecast errors across such a diverse landscape is equally challenging, requiring multi-dimensional visualizations across space, time, and climate measurements. Better understanding of the nature and patterns in forecast errors across the U.S. helps meteorologists as they strive to improve weather forecasts. It can also help everyday Americans know how much faith to put in the weather forecast on the day of an important event.

The 2018 Data Expo of the Sections on Statistical Computing and Statistical Graphics of the American Statistical Association (ASA) provided an opportunity to explore and compare weather forecast errors across the U.S. Our analysis focused on the question:

         How do weather forecast errors differ across regions of the U.S.?

This motivating question prompted the subsequent questions:

  • Do U.S. weather stations cluster into regions based on weather characteristics?

  • How do error variables correlate and do these correlations change by region?

  • How do forecast errors change by region and by season?

  • Where are the best and worst forecast accuracies?

  • Which variables are important in determining forecast errors?

Preliminary results of our analysis are published in the proceedings for the 2018 Joint Statistical Meetings (Lundell et al. 2018).

This article is devoted to answering these questions. We use ensemble graphics to create an overall picture of weather forecast errors across different regions of the U.S. (Unwin and Valero-Mora 2018). Ensemble graphics enhance traditional analyses by connecting several visualizations of the data with adjoining text. This presentation is able to tell a cohesive story of the data more effectively than would be possible with a few disjointed graphics. In Sect. 2, we summarize the data and then show that the U.S. can be clustered into six well-defined weather regions using the provided climate measurements, elevation, and distance to coast. These clusters, or weather regions, form the basis of our comparison of forecast accuracy across the U.S. through a series of multi-dimensional plots and variable importance analyses described in Sect. 3. In Sect. 4, we introduce the interactive application we created to enhance our data explorations. We conclude in Sect. 5 that the climate differences that distinguish the weather regions of the U.S. also create region-specific patterns and differences in forecast accuracy. Two appendixes are included at the end of this paper to explain data cleaning and how to create the glyphs used in this article.

2 Weather regions


The data contain measurements and forecasts for 113 U.S. weather stations from July 2014 to September 2017. Information about the data and other analyses done with the data can be found in Cetinkaya-Rundel and Martinez (2023). These data can be obtained from our supplemental materials or at the following URL:

https://community.amstat.org/jointscsg-section/dataexpo/dataexpo2018.


Daily measurements for eight different weather metrics were recorded for each location including temperature, precipitation, dew point, humidity, sea level pressure, wind speed, cloud cover, and visibility. Many notable weather events are also textually recorded such as thunderstorms and fog. Daily measurements of the minimum, maximum, and mean were recorded for each metric. Weather characteristics used in this article are listed in Table 1. Data were supplemented with some geographic information and carefully examined and cleaned. Details on data cleaning, obtaining additional data, and the justification behind our final variable selection are found in Appendix A.

Table 1 Weather variables included in our analysis. All observations outside the indicated ranges were removed prior to our analysis

2.1 Developing weather clusters

The U.S. has been divided into regions based on environmental characteristics such as watersheds and climate (Commission for Environmental Cooperation 1997, Briggs et al. 2003). We examined the set of existing environmental regions and were unable to find one that made sense in terms of weather in the context of this analysis. We created our own weather regions by clustering the weather stations based on the metrics in Table 1. Thus, clusters are defined by weather characteristics observed at each station. We use these clusters to determine how weather forecast error patterns are related to the unique climate measurements of a particular region. A review of existing weather regions and how they correspond to our weather regions is discussed in Sect. 2.2. Data were aggregated across each weather station by taking the mean and standard deviation of each variable in Table 1 for each of the 113 weather stations over the period of record.

Hierarchical clustering (Hastie et al. 2001, pp. 520-526) with Euclidean distance and Ward’s minimum variance clustering method (Murtagh and Legendre 2014) was used to identify clusters. The clusters were examined spatially to determine the performance of the clustering method and select the final number of clusters. We wanted to ensure the weather station clusters were of a sufficient size to be practical. Five clusters resulted in one cluster that included all of the stations from the Midwest to the East Coast which we think is too large because of the differences in coastal and inland climates. Seven clusters produced a cluster that contained only five weather stations which is too small. Thus, we chose six clusters to divide the U.S. into weather regions.

Figures 1 and 2 show the results of the cluster analysis. Figure 3 shows a parallel coordinate plot of the characteristics for each weather region. The Z-score for mean and standard deviation for each of the variables in Table 1 was computed and plotted on the parallel coordinate plot. It is difficult to distinguish the six weather regions from each other so an interactive app was created that provides a better view of the features of each cluster. The app is discussed in Sect. 4.

The names and characteristics of each weather cluster are as follows:

  • Cali-Florida (13 stations): Warm and humid with high dew point and pressure. Low variability in almost all measurements.

  • Southeast (22 stations): Warm and humid with lots of rain. High variability in precipitation and low variability in temperature.

  • Northeast (39 stations): Cold, humid, and low visibility. High variability in temperature, dew point, and pressure.

  • Intermountain West (19 stations): Cold and dry, with high variability in temperature, wind speed, and pressure. Low variability in precipitation and dew point.

  • Midwest (13 stations): Landlocked with high wind speed and high variability in temperature, pressure, and wind speed.

  • Southwest (7 stations): Warm, sunny, and dry with little variation in temperature or precipitation. High variability in wind speed and humidity.

Fig. 1
figure 1

Map of the six weather regions. The color band at the bottom identifies each region by name and color

Fig. 2
figure 2

Dendrogram of weather clusters identified in Fig. 1

Fig. 3
figure 3

Parallel coordinate plot of the means and standard deviations of the weather variables listed in Table 1. Each line in the plot represents one of the 113 weather stations. The color of the lines match the weather region to which the station belongs. An interactive app is available that allows for better identification of regional trends.The Southwest region is highlighted in this graph to emphasize its weather characteristics

2.2 Comparison to existing climate regions

Ecological and climate regions have been developed for the U.S. in other studies. Many of these studies focused on smaller regions in the U.S., but a few have looked at the U.S. as a whole. Clustering methods and the variables used to identify clusters differ from study to study. The ecological regions of North America defined by the Commission for Environmental Cooperation (Commission for Environmental Cooperation 1997) used ecosystems to develop regions. Air, water, land, and biota, including humans, were used to create the ecoregions. These ecoregions show a strong longitudinal trend that corresponds well with the longitudinal trends in our clusters. Clusters were not determined by statistical clustering methods, but by careful assessment of ecological properties across North America.

The National Oceanic and Atmospheric Administration (NOAA) developed climate regions that incorporate seasonal temperature and precipitation information (Karl and Koss 1984). These regions differ substantially from the North American ecological regions as they also have a lateral trend in addition to the longitudinal trend and are constrained by state boundaries. Spectral curves assessing drought and wet spells were used to define the NOAA regions (Diaz 1983). The NOAA regions correspond roughly to our general weather regions despite region borders being defined by state boundaries. The north/south division in the eastern U.S. closely aligns with our cluster division in that area. The major east/west division in our clusters is in a similar location to the NOAA clusters as well.

The International Energy Conservation Code (IECC) climate clustering of the U.S. (Briggs et al. 2003) and subsequent reclassification by Hathaway et al. (2013) divided the U.S. into fourteen regions based on temperature, dew point, wind speed, and radiation. Cluster methods included K-means clustering and Monte-Carlo sifting. Monte-Carlo sifting is a method developed by Hathaway et al. (2013) that identifies a candidate and dropout reference cell by iteratively searching through a set of Monte-Carlo runs. Both sets of regions show a strong lateral trend in the Eastern U.S. These regions also show distinct separation of the West coast and Southwest deserts from the rest of the Western U.S. Similar trends are also seen in our clusters. The lateral trend in the Eastern U.S. is not as strong in our clusters, but this is likely because we chose a smaller number of weather clusters. The inclusion of additional variables insensitive to lateral trends such as distance to coast, elevation, and humidity, all serve to reduce the lateral separation in our clusters.

One key difference between our weather regions and the regions seen in other studies is that we combine Florida and the Pacific coast into a single weather region. This is likely a result of our choice to omit geographic proximity of weather stations in the cluster analysis calculations and consider only similarities in weather patterns. Both Florida and the Pacific coast experience less seasonality in their weather patterns than the rest of the country. This results in smaller than average standard deviations for many of the climate variables in both of these regions. These small standard deviations create a measure of closeness between Florida and the Pacific coast, which likely explains why these two geographic areas fall into a single cluster when working with six or fewer clusters. The Florida and Pacific stations split into separate clusters when using seven clusters with exception of two stations from the Pacific Coast that cluster with the Florida stations. Hawaii and Alaska are either ignored in the literature or placed in their own regions. Because we did not use spatial proximity as a clustering variable and we assigned all weather stations to one of our six weather clusters, Hawaii and Alaska are clustered with Cali-Florida and the Northeast respectively. Our clusters show that weather patterns typically have strong spatial correlations, with temperate coastal regions being a notable exception.

3 Forecast error explorations

Given the clear separation of the country into distinct weather regions, we seek to determine if there are clear differences in forecast error patterns among the regions. Forecasts were restricted to minimum temperature, maximum temperature, and the probability of precipitation. The forecast error for minimum and maximum temperature is calculated as the absolute difference between forecast and measurement. The forecast error for precipitation is measured using the Brier Skill Score (BSS), a well-known measure of probabilistic forecast accuracy (Weigel et al. 2007). It is defined for a particular weather station as

$$\begin{aligned} \text{ BSS } = 1 - \frac{\sum \nolimits _{i=1}^N\sum \nolimits _{j = 0}^M\left( Y_{ij} - O_{i}\right) ^2}{\sum \nolimits _{i=1}^N\sum \nolimits _{j = 0}^M\left( P - O_{i}\right) ^2} \end{aligned}$$
(1)

where \(Y_{ij} \in \left[ 0, 1\right]\) is the predicted probability of rain on day i with forecast lag j; \(O_{i} \in \left\{ 0,1\right\}\) is a binary variable with value 1 if any precipitation fell during the day and 0 otherwise. We define a precipitation event as a positive precipitation measurement or the inclusion of the words “rain" or “snow" in the event information; \(P \in \left[ 0,1\right]\) is the average daily chance of precipitation over the period of interest, defined as \(P = \frac{1}{N}\sum \nolimits _{i=1}^NO_i\); N denotes the number of days of recorded precipitation in the period of record and \(M \in \left\{ 0, \ldots , 5\right\}\) denotes the number of forecast lags.

Note that the \(\text{ BSS } \in (-\infty , 1]\), with 1 indicating a perfect forecast skill and movement towards \(-\infty\) indicating worse forecasts. We chose to use \(1-\text{ BSS }\) so all three error variables are consistent in orientation. The following subsections explore differences in forecast errors both between and within the previously defined weather regions visualized in Fig. 1. Forecast errors are averaged over lag and in some cases averaged over month in each graph. The visualizations in the following subsections confirm our hypothesis that different weather regions experience distinctly different weather forecast error patterns.

3.1 Error correlations

Are the forecast errors for the three different measurements (i.e., minimum temperature, maximum temperature, and precipitation) correlated with each other? How do these relationships change between the different weather regions? We explore such correlations through the use of correlation ellipses (Murdoch and Chow 1996) superimposed on a map of the U.S. in Fig. 4. We calculated Spearman correlations between each pair of measurements for the locations within each cluster. The sign of the correlation coefficient is denoted by the slope of the ellipse and the strength of correlation is denoted by the width of the ellipse.

All of the correlations between error variables are positive except for correlations between minimum temperature and the other two variables in the Northeast. The strongest relationships are seen in the Midwest, the South and the Southwest. The weakest relationships are found in the Northeast. Only a few cluster-specific correlations are significant. This is likely due to the small number of stations in many of the weather regions. However, the overall correlations for the 113 weather stations are all positive and significant. This indicates that areas with good predictions for one forecast variable have generally good predictions for the other forecast variables as well. The weakest correlations are between minimum temperature and precipitation predictions. Although there are relationships between the three weather forecast variables, those relationships are not particularly strong and the strength differs within each region. The observations made using this correlation ellipse map illustrate how this plot style facilitates multi-dimensional comparisons across space. Information on the calculations and implementation of the correlation glyphs can be found in Appendix B.

Fig. 4
figure 4

Spearman correlations between forecast error variables represented as ellipses superimposed on a map of the United States. The p value for each correlation is compared against a 0.05 level of significance

3.2 Error scatterplots

Scatterplots reveal outliers and overall trends within weather regions and across forecast lag. Forecast lag is defined as the number of days between the day of forecast and the day being forecast. Thus, same day forecasts would have a lag of 0, one day prior forecasts a lag of 1, and so on. Because we are comparing three variables spatially and temporally across the U.S., static graphs are not optimal for assessing all relationships of interest. We constructed an interactive scatterplot app using Shiny (Chang et al. 2019) that facilitates examination of trends between the three forecast error variables aggregated across all forecast lags or for individual forecast lags. Figure 5(a–c) shows examples of plots from the interactive app. The figure shows the scatterplot for the data aggregated over all forecast lags, as well as the scatterplots for lags of 5, 3, and 1, to illustrate how forecast accuracy changes over forecast lag.

Fig. 5
figure 5

Scatterplots comparing the three forecast error variables. The scatterplot to the left of the map is aggregated over all forecast lags. Points of interest discussed in the text are highlighted in the plots

Figure 5a compares minimum temperature forecast accuracy with precipitation accuracy. Weather stations with the worst predictions of minimum temperature are located in New England and the Intermountain West. New England is known for extreme winter weather and the frequency of extreme weather events seems to be increasing (Cohen et al. 2018). This likely contributes to the struggle these stations have predicting minimum temperature. The worst predictor of minimum temperature is Austin, Nevada. This location is addressed further in Fig. 5c. Cali-Florida uniformly has the best predictions of minimum temperature. However, Cali-Florida also has some of the greatest variability in precipitation prediction accuracy when examining individual lags.

Figure 5b compares maximum temperature prediction accuracy with precipitation accuracy. Four weather stations in the Great Lakes region have the worst precipitation predictions in the dataset. Poor precipitation forecast accuracy in this region illustrates the difficulty in forecasting lake-effect snow. This phenomenon is discussed in greater depth in Sect. 3.3. Precipitation forecast accuracy for the Great Lakes region improves substantially as the forecast lag decreases and forecasts with lag 1 are as accurate as the rest of the nation.

Figure 5c shows the relationship between minimum and maximum temperature forecast accuracy. Three outliers stand out in these scatterplots, namely Key West, Florida, Austin, Nevada, and San Francisco, California. Key West predicts both minimum and maximum temperature more accurately then any other weather station. Key West also ranks in the top five for lowest variability in eight of the weather variables, which likely explains the accurate forecasts. Austin is the poorest predictor of both measures. Seventy miles along the “loneliest highway in America”(The Greater Austin Chamber of Commerce 2018) separate Austin from its weather measurements which were collected in Eureka, Nevada. The poor predictions for maximum and minimum temperature can be explained by the change in climate over such a large distance. This is reflected in a negative prediction bias of around 5\(^\circ\)F for maximum temperature and a positive bias of around 7\(^\circ\)F for minimum temperature. San Francisco has good predictions of minimum temperature and poor predictions for maximum temperature. This phenomenon is further explained in Sect. 3.3.

The interactive app developed in conjunction with this project allows for further investigation of forecast accuracy trends. The app is discussed in Sect. 4.

3.3 Seasonal trends

The position of the U.S. in the northern hemisphere makes most of the country subject to distinct weather seasons. Seasons are most pronounced in the northern U.S. We hypothesize that the forecast error behavior is inextricably linked to this seasonality. We explore this through a series of space-time graphs. Modeling space and time simultaneously creates a three-dimensional problem usually visualized as small multiples. Small multiples are “a series of graphics, showing the same combination of variables [e.g., latitude and longitude], indexed by changes in another variable [e.g., time]” (Tufte 2002, p. 170). The issue with this approach is that it becomes difficult to visually comprehend all but the most drastic changes from graph to graph. One alternative that allows simultaneous visualizations of both space and time is through the use of glyphs, or symbols, that allow for multi-dimensional visualizations in a spatial context (Carr et al. 1992,Wickham et al. 2012).

Figure 6 shows glyph plots of seasonal forecast errors throughout time. The forecast error is visualized as the scaled distance from a center point to the edge of a polygon with twelve observations starting with January at the 12:00 position and proceeding clockwise. The asymmetry of the glyphs about their center points illustrates how forecast errors change across time and across space. For example, locations in the Northeast are worse at forecasting precipitation in the winter than in the summer, while locations in the Southeast forecast precipitation equally well throughout the year.

Fig. 6
figure 6

Glyph plots of weather forecast accuracy averaged by month. The error is represented as the scaled distance from a center point to the edge of a polygon beginning with January at the 12:00 position and proceeding clockwise

In addition to highlighting forecasting asymmetries, Fig. 6 reveals location-specific anomalies. For example, San Francisco, California, predicts minimum temperatures well all year, but only predicts maximum temperatures well in the winter months. This is likely due to chilling coastal fogs known to frequent the region throughout the year that can create sharp temperature differences over short distances (Nolte 2016). The struggle to predict temperature seems reasonable in light of these facts as this measurement location is more than 11 miles inland from the forecast location. The issue is likely less pronounced in the winter because the contrast between inland and coastal temperatures is reduced.

Maximum temperature predictions are particularly poor in the summer months in Austin, Nevada. It is unclear why predictions are worse in the summer than in the winter.

Another location-specific anomaly of note is the drastic seasonality of precipitation forecasts for locations surrounding the Great Lakes, as observed in Fig. 6. The error scatterplots in Fig. 5b show that precipitation accuracy is poor in this region, but the seasonality of the predictions cannot be observed in the scatterplots. The unusually bad forecasting in the winter is likely due to lake-effect snow which is prevalent in the region. Up to 100% more snow falls downwind of Lake Superior in the winter than would be expected without the lake-effect (Scott and Huff 1996). This area has been previously identified as having the most unpredictable precipitation patterns in the nation (Silver and Fischer-Baum 2014). The above examples demonstrate the ease with which comparisons can be made across space and time with these glyph-based plots. Information about how to generate the glyphs is included in Appendix B.

3.4 Variable importance

The differences in forecast error patterns across regions prompt identification of the most important climate measurements for predicting forecast error. We used random forests (Breiman 2001) to determine which weather variables had the greatest impact on the forecast errors. The data were aggregated over forecast lag and month. Three random forest models were generated for each weather region using the forecast error variables as the response. The means and standard deviations for each of the weather variables listed in Table 1 and the forecast lag were the predictor variables. Figure 7 contains three parallel coordinate plots that show the variable importance measures in each region for each forecast error variable. The importance measures obtained from random forests were recentered by subtracting the minimum importance measure and then rescaled to the interval (0, 100) by dividing by the maximum importance measure of the recentered values for each weather cluster and forecast error variable combination, and finally multiplying by 100. Thus, the most important variable within each weather region has a value of 100 and the least important has a value of 0 for each error measure. This allows direct comparisons of importance between weather regions and across error measures.

Figure 7 shows that the most important variable for the precipitation error is forecast lag regardless of weather region. None of the other variables are very important relative to lag. The Southeast shows minimum dew point (DP) and the standard deviation of maximum dew point as being somewhat important. Cloud cover is important for the precipitation error in the Northeast.

Forecast lag is also the most important variable for the maximum temperature error for all weather regions except Cali-Florida. The standard deviation of maximum temperature and maximum wind speed (WS) are more important than lag in Cali-Florida. The variability in maximum temperature is also important for the Southeast, Northeast, and the Intermountain West. Distance to coast (Dist2Coast) and elevation are important for the maximum temperature error in the Intermountain West.

Variables that are important for the minimum temperature error varied substantially across weather regions. The variability in minimum temperatures is important for all regions, but other important variables differ widely from region to region. Minimum temperature is the most important for the Northeast and Intermountain West, but maximum temperature is important for the Southeast. Minimum dew point and the variability in the maximum sea level pressure (SLP) are important in the Southwest while variability in minimum sea level pressure is the most important for the Midwest, Southeast, and Southwest. Forecast lag is not particularly important for any of the regions except for the Midwest.

Fig. 7
figure 7

Variable importance for each of the three forecast accuracy measurements. Variable importance measures have been rescaled to make the measures directly comparable between weather regions and accuracy measures

4 Interactive application

It is difficult to identify the patterns in climate measurements and forecast errors for all weather regions with static visualizations. We developed an interactive Shiny app to enhance our weather data explorations. This app can be accessed at

      https://jilllundell.shinyapps.io/finaldataexpoapp/.

The first tab of the app is an interactive version of the parallel coordinate plot introduced in Fig. 3. The app allows the user to select a weather region which is highlighted on the graph. Characteristics of the selected region can be easily seen and compared to all other observations.

The second tab of the app is an interactive scatterplot. Figure 5(a–c) shows examples of the graphs generated in this tab. The user can select up to two of the three forecast error variables to be on the axes. The forecast lag can also be selected. Points on the scatterplot can be brushed or clicked and the selected points show up on a map of the U.S. Information about selected stations is listed in a table under the graph. The idea of linked brushing between scatterplots and maps was first introduced in Monmonier (1989). This app allows for a more complete exploration of outliers and trends in the data across forecast lags and between error variables than a static graph.

5 Conclusions

Climate patterns in the United States cleanly separate into six recognizable regions through a cluster analysis using the means and standard deviations of the weather variables provided in Table 1. We explored the relationship between the three weather forecast variables (i.e., minimum temperature, maximum temperature, and precipitation) using correlation ellipses shown in Fig. 4. We found that all clusters show signs of positive correlations among the error variables with the exception of the Northeast cluster.

We visualized the pairwise relationship between forecast errors through a series of scatterplots across all forecast lags in Fig. 5. These plots highlight the superiority of locations in the Cali-Florida region for predicting minimum temperature across all lags, and also show that the poor precipitation predictions of the Great Lakes region are mostly confined to forecasts greater than lag 2. Lastly, the abnormally high errors in Austin, Nevada, are likely a product of the large distance between forecast and measurement locations.

We explored seasonal differences of forecast errors in Fig. 6 and observed that seasonal differences in forecast errors tend to be more pronounced in northern, inland clusters than southern clusters. We also showed that location specific anomalies, such as the asymmetry in seasonal maximum temperature forecast errors in San Francisco and the precipitation forecast errors near the Great Lakes, have plausible explanations in the literature.

Next, we compared the important variables in determining forecast errors across clusters using scaled random forest variable importance measures in Fig. 7. These measures demonstrate that forecast lag is most important in determining the maximum temperature and the precipitation forecast errors, but not important in predicting the minimum temperature forecast errors. Many clusters place similar importance on a few variables, but there are some variables that are important only in a single cluster, such as the importance of maximum wind speed in predicting the maximum temperature forecast error in Cali-Florida.

For further insight regarding the nature of forecast errors across these six clusters, we refer readers to our R Shiny app described in the previous section. A current version of the app can be found at the following URL:

      https://jilllundell.shinyapps.io/finaldataexpoapp/.

This app, in conjunction with the visualizations presented in this article, reinforces the idea that the U.S. cleanly clusters into well defined weather regions and patterns in forecast errors are closely related to the unique climates that characterize each region.

The visualizations in this paper, both interactive and static, were designed to be scalable for larger weather datasets. We anticipate illustrating this capability on an expanded set of stations in the future. An expanded analyses will also serve to validate the regional patterns observed and described in this paper. In addition, we anticipate adapting several of the static glyph plots presented in this paper for interactive use. Greater interactivity will allow for more detailed explorations of weather patterns in the United States across both time and space.