Friday, January 6, 2012

Benchmarking and Assessment applied to USHCN

Firstly, an up-front caveat, I am third (last) author on this paper.

Today the Journal of Geophysical Research has published a paper that applies the benchmarking and assessment principles of the surface temperature initiative to the USHCN dataset of land surface air temperatures.
Williams, C. N., Jr., M. J. Menne, and P. Thorne
Benchmarking the performance of pairwise homogenization of surface temperatures in the United States
J. Geophys. Res., VOL. 117, D05116, 16 PP. doi:10.1029/2011JD016761, .  (behind a paywall - sorry) (URL and details updated 10/31)

Update 1/19/12: NCDC have provided a version commensurate with AGU copyright guidance (and copyright AGU) at (updated 10/31/12 following ftp server move). The paper is also highlighted on their "what's new" page at .

The analysis takes the pairwise homogenization algorithm used to create the GHCN and USHCN products and does two things.

Firstly, it identifies a large number of decision points within the algorithm that do not have an absolute basis and allows these to vary. A good climate science analogy here is the perturbed climate model runs of the project and other similar projects. These decision points were varied by random seeding of values to create a 100 member ensemble of solutions. This at least starts to explore the parametric uncertainties within the algorithm (and any interdependency's) and their implications for our understanding of the observed temperature record evolution. However, what it does not necessarily do is give us any better an idea as to what the true climate evolution may have been. Which brings us on to the second innovation and the focus of this post ...

Secondly, in addition to running on the observations these ensembles were run on a set of eight analogs to the USHCN network. These consisted of sets of data which directly mimicked the observational availability of the USHCN network itself through time. They were based upon climate model runs from a range of models and a range of forcing scenarios. This ensures some 'plausible' spatio-temporal coherency to the large-scale temperature fields. On top of these were super-imposed additional differences to mimic potential random and systematic influences of instrumental and operational artifacts. A set of distinct possibilities were explored across the eight worlds ranging from no systematic biases at all (highly improbable but a useful 'algorithm does no harm' test) through to a scenario where the network was bedevilled with very many largely small breaks with a sign-bias tendency - a situation which any algorithm would find hard to cope with. Unlike the real-world these analogs afford a luxury of knowing the true answer so that it is possible to actually benchmark and understand the fundamental algorithm performance and any limitations. Then it is possible to re-evaluate the real-world results afresh with these new insights gleaned from such realistic test-beds.

The results from the analogs were broadly encouraging. First and foremost when applied to the data with no breaks added virtually no adjustments were made and the impact on large-scale averaged timeseries and trends was so minuscule a magnifying glass would be required to tell the difference. So, in the implausible eventuality that the raw data are bias free the algorithm really would do no harm. For the other analogs the performance was mixed. Easier cases where breaks were bigger and metadata better it did better. Harder cases it fared worse. Where there was no overall sign bias in the applied breaks the ensemble was spread relatively evenly around the starting data. But where there was an overall bias in the raw data presented to the algorithm it consistently moved the overall data in the right direction but rarely far enough. The implication being that this uncertainty is effectively one-tailed. The chances of over-shooting the adjustments is substantially smaller than the chances of under-shooting the required adjustments.

So, what does the real-world look like when reassessed through this new understanding?

For minimum temperatures over the longest timescales the trends are spread around the raw data. But in the periods 1951 onwards and 1979 onwards it is spread distinctly either side of the raw data. Post 1951 trends in the raw data exhibit too little warming, and post-1979 too much. This is consistent with prior understanding of the biases of the change of time of observation (largely 1950s-1970s; spurious cooling) and move to MMTS sensors (early to mid-1980s and often associated with a microclimate relocation; spurious warming) on Tmin.

For maximum temperatures the ensemble consistently precludes the raw observations over all considered timescales. The raw data are almost certainly biased and show too little warming. Over all periods the ensemble of solutions show more warming than the raw data. Again, this is consistent with current understanding of the impact of time of observation biases (again a spurious cooling) and the transition to MMTS which unlike for minimum temperatures imparts a spurious cooling effect. The less than encouraging implication from the analogs, however, is that in the operational USHCN algorithm we are substantially more likely to be under-estimating the required adjustments, and hence rate of warming in maximum temperatures, than over-estimating it.

Is this the last word on the issue? Certainly not. As this was a first step along this path the analogs were perhaps not as sophisticated as would ideally be the case. And here we hope the benchmarking and assessment group can provide more realistic (and global) analogs later this year. Further, this considered solely parametric uncertainty - varying choices within one algorithm. The larger and more difficult uncertainty to understand is the structural uncertainty that would result from applying fundamentally distinct approaches to the same problem and allow a better exploration of the possible solution space. And here we have to look to you, readers, to develop new and novel approaches to homogenizing the data and submit them to the same raw data and analogs to allow consistent benchmarking and better understanding.

Finally, over coming weeks we will be hosting code, data, and metadata (including the analogs) online. I'll provide an update when it is all up there but given that its several Gb and requires fitting around other duties its not going to be instantaneous by any stretch.

Comments are welcome, but please remember that this is a strictly moderated blog and to follow the house rules.


  1. what does "precludes" mean in
    "the ensemble consistently precludes the raw observations ..."

  2. Hank, precludes means that the ensemble range does not include the raw data - so the entire ensemble is moved away from the raw data in one direction.