Monday, July 26, 2010

White paper 9 - Benchmarking homogenisation algorithm performance against test cases

The benchmarking homogenisation algorithm performance against test cases white paper is now available for discussion. When making posts please remember to follow the house rules. Please also take time to read the full pdf before commenting and where possible refer to one or more of section titles, pages and line numbers to make it easy to cross-reference your comment with the document.

The recommendations are reproduced below:
• Global pseudo-data with real world characteristics

• GCM or Reanalyses data should be used as source base with real spatial, temporal and climatological characteristics applied to recreate to a reasonable approximation the observational record statistics

• Review of inhomogeneity across the globe finalised via a session at an international conference (link with White Paper 8) to ensure plausibility of discontinuity test cases

• Suite of ~10 discontinuity test cases that are physically based on real world inhomogeneities and orthogonally designed to maximise the number of objective science questions that can be answered

• Benchmarking to rank homogenisation algorithm skill in terms of performance using climatology, variance and trends calculated from homogenous pseudo-data and inhomogenous data (discontinuity test cases applied) (skill assessment to be synchronised with broader efforts discussed in White Paper 10)

• Independent (of any single group of dataset creators) pseudo-data creation, test case creation and benchmarking

• Peer-reviewed publication of benchmarking methodology and pseudo-data with discontinuity test cases but ‘solutions’ (original homogenous pseudo-data) to be withheld


  1. Performing tests of homogenization methods using pseudo-data is a very worthwhile exercise. But don't get too hung up on picking winners and losers or ranking the methods. Most "reasonable" methods are not going to differ than much in overall skill, but could seem different depending on the exact nature of the pseudo-data. More likely, some methods may perform better under some circumstances, poorer under others -- picking an overall "winner" could then be tough. Perhaps the most useful outcome is to find the "bad" methods that should not be considered further, as well as defining typical "success" and "failure" rates of the methods as a group.
    It would be very valuable to have pseudo-datasets created independently of the any homogenization teams. Blind tests would be very desirable. In addition, you might want to consider different sets of test data -- skewed to have a certain kind(s) of characteristic(s). For example, a set that has a systematic bias of one sign, a set that has drifts in data (rather than descrete change-points), etc. Each set would focus on one particular malady. And then you could have a few sets that have combinations.
    In creating the pseudo-datasets you should strive for a broad spectrum of practitioners. One rule of thumb with data problems, such as homogeneities is that when you finally think that you've seen every possible thing that has gone wrong, all you need to to it to look at another station and you'll find a new type of malady!

  2. 1) "Homogeneous observational datasets" do not exsist in the real world, this state can be approached only. I think that the use of this terminology can be misleading. (I Recommend instead: quasi homogeneous dataset, or: first class quality dataset...)

    2) One problem has not been concerned: if we create 10 test datasets with "wide range of discontinuity types", likely some parts of them will rather well represent the inhomogeneity-properties of real observed datasets. But we still do not know how large part of the test datasets represents well observed datasets or certain kinds of observed datasets. - I made a test-dataset in which large number of statistical characteristics of detected inhomogeneities were set to be very similar to characteristics of detected inhomogeneities in a central european real temperature dataset. Note, only detected characteristics can be made similar, because one never knows well the characteristics of real inhomogeneities of real datasets. - The method and my conclusions are presented in several international conferences in the recent years, and now they are included in two papers submitted for publishing in per-reviewed journals. - A blog is not the most proper form for sharing my experiences, however, I am open for further collaboration.

    Peter Domonkos,
    Tortosa (Spain), Univ. Rovira i Virgili

  3. Good. A little confusing in its use of the word "discontinuity" to include gradual artifacts such as an urbanisation effect, which may start from zero and grow continuously.
    One approach not described here, which might bear fruit, would be the publication (and versioning) of code to generate inhomogeneous datasets. Then benchmarking as described could generate a new inhomogeneous set on demand. This would prevent homogenisation algorithms from becoming tuned, deliberately or otherwise, to any particular inhomogeneous test dataset. (Instead the homogenisation algorithms will become tuned to the inhomogenisation algorithms, which is probably healthier).

    The Climate Code Foundation would be interested in being a "non-traditional participant", line 229.

  4. To follow on the other comments (with which I agree), I did not catch in this document an explicit statement that the simulated datasets will include a wide range of candidate climate time series. It may be tempting to simply use realistically forced climate model runs, for example, but we are interested in knowing whether the homogenisation methods can distinguish between the expected long-term changes and unexpected ones. We thus need to simulate a number of alternate (perhaps unrealistic) versions of the actual climate and weather, as well as alternate versions of the possible inhomogeneities.

    Also, I commented on Paper #8 so might as well repeat here that these two papers could be better integrated (though that is more an issue for the other paper noting what this one says, than the reverse).

  5. Posted on behalf of Peter Domonkos:

    4) I see with pleasure that the core group of the Surface Temperature project is determined to fulfil well-founded tests before selecting quality-control and quality-improvement methods. I hope and wish that nothing will hinder from going through on the planned way of testing and applying quality-improvement methods.

    5) I would like to comment one blog. “Statsman” wrote on 28 July:
    “…But don't get too hung up on picking winners and losers or ranking the methods. Most "reasonable" methods are not going to differ than much in overall skill, but could seem different depending on the exact nature of the pseudo-data…… Perhaps the most useful outcome is to find the "bad" methods that should not be considered further, as well as defining typical "success" and "failure" rates of the methods as a group….”

    Although Statsman has some seemingly nice arguments, I absolutely disagree with his main consequence: what the “most useful outcome” is according to Statsman, is the possible poorest outcome in my opinion, since it would equal with our present knowledge about OMIDs (Objective Methods of Inhomogeneity Detection). I hope that the Surface Temperature project will achieve substantial advance in the evaluation of efficiency for OMIDs. When OMIDs have different rank-orders according to the applied pseudo-data, experts have to evaluate that with which probability / frequency we face with the analysed types of pseudo-data in practice. Even when the mean gain seems to be little, the most useful outcome is when a clear and detailed evaluation of efficiency-characteristics for individual OHOMs is presented, since it is costless to change a rather good method to an even better one, or to apply various, but well specified methods according to geographical regions if the results suggest its benefit.

  6. Benchmarking is important, and it should be done by relevant researchers who are involved in this kind of labor-intensive works. We involved climate modelers and MRI/JMA people who can handle super-high resolution models, we are also doing this kind of "perfect model" approach to estimate the potential errors which arise from the imperfect station distribution. Also we are doing cross-validation test. I believe thins kind of test should contribute future observation system design. I know WMO's standard or recommendation of station network density (synoptic, climatology, rain-gauge network), but it should be improved with the current observation records.