Tuesday, July 27, 2010

White paper 3 - Retrieval of historical data

The Retrieval of historical data white paper is now available for discussion. When making posts please remember to follow the house rules. Please also take time to read the full pdf before commenting and where possible refer to one or more of section titles, pages and line numbers to make it easy to cross-reference your comment with the document.

The recommendations are reproduced below:

1. A formal governance is required for the databank construction and management effort that will also extend to cover other white paper areas on the databank. This requires a mix of management and people with direct experience wrestling with the thorny issues of data recovery and reconciliation along with expertise in database management and configuration management.

2. We should look to create a version 1 of the databank from current holdings at NCDC augmented by other easily accessible digital data to enable some research in other aspects of the surface temperature challenge to start early. We should then seek other easier targets for augmentation to build momentum before tackling more tricky cases.

3. Significant efforts are required to source and digitise additional data. This may be most easily achieved through a workshop or series of workshops. More important is to bring the ongoing and planned regional activities under the same international umbrella, in order to guarantee that the planned databank can benefit from these activities. The issue is two-fold: first the releasing of withheld data, and secondly the digitising of data in hard copy that is otherwise freely available.

4. The databank should be a truly international and ongoing effort not owned by any single institution or entity. It should be mirrored in at least two geographically distinct locations for robustness.

5. The databank should consist of four fundamental levels of data: level 0 (digital image of hard copy); level 1 (keyed data in original format); level 2 (keyed data in common format) and level 3 (integrated databank/DataSpace) with traceability between steps. For some data not all levels will be applicable (digital instruments) or possible (digital records for which the hard copy has been lost/destroyed), in which case the databank needs to provide suitable ancillary provenance information to users.

6. Reconciling data from multiple sources is non-trivial requiring substantial expertise. Substantial resource needs to be made available to support this if the databank is to be effective.

7. There is more data to be digitised than there is dedicated resource to digitise. Crowd-sourcing of digitisation should be pursued as a means to maximise data recovery efficiency. This would very likely be most efficiently achieved through a technological rather than academic or institutional host. It should be double keyed and an acceptable sample check procedure undertaken.

8. A parallel effort as an integral part of establishing the databank is required to create an adjunct metadata databank that as comprehensively as feasible describes known changes in instrumentation, observing practices and siting at each site over time. This may include photographic evidence, digital images and archive materials but the essential elements should be in machine-readable form.

9. Development may be needed of formalized by new WMO arrangements, similar to those used in the marine community, to facilitate more efficient exchanges of historical and contemporary land station data and metadata (including possibilities for further standardization).

10. In all aspects these efforts must build upon existing programs and activities to maximise efficiency and capture of current knowledge base. This effort should be an enabling and coordination mechanism and not a replacement for valuable work already underway.


  1. Part of the process of acquiring data or documenting datasets involves human contacts - and may require quite a bit of effort just to locate the right person to deal with. Information may be provided by such person(s) in hardcopy or digital form, or by word of mouth. I think it is vital to insure that such information becomes part of the permanent metadata. For example, the name, affiliation and contact information for the person that provided the dataset (and when) or the person who provided information about the dataset or data measurement procedures. Vital information gained through face to face or via phone can be lost unless effort is made to preserve it in digital form. For example, Mr. Sam Jones, of the National Meteorological Center of Country X told me on July 29, 2010 in a phone conversation that his country has had a problem for several decades of observers reporting observations rounded to the "0" or "5" digit, etc. He also told me that the missing data from Apr-Sep of 1988 may be in boxes in some storeage area, but he doesn't have access to the area and is not sure who knows what is back there.

  2. Good.

    1. Can we get a large digital search/scan provider, e.g. Google or Amazon, to contribute resources and action? They have a lot of experience of scanning vast quantities of paper.

    2. Discard nothing, ever. Certainly don't discard pre-existing level 2+ data just because it differs, or appears to differ, with level 0 or 1 data. Add metadata, tag and flag, but never, ever delete anything.

    3. When there is matching level 0 and level 1 or 2 data available for the same stations, if the similarity is determined by statistical sample then that could be used to guide re-keying of the data (so level 0 data which is very dissimilar to level 2 data could be re-keyed sooner).

    3. Talk to GalaxyZoo and Distributed Proofreaders about crowd-sourcing.

  3. On contents in lines 195-200: Be careful, not only paper records can be subject to deterioration, since digital media can be even more perishable than paper format (fast changes in technology). It should be stated not only the need of preservation for paper records, if not also the need for migration into new digital formats the imaged records.

    L291-293. This sentence is incomplete and incorrect. Countries that don’t make freely available their historical climate data, which are most of the developing countries and a high number of the developed countries unfortunately, they don’t make data available not only because “political and/or financial imperatives”. Also they don’t, because their national laws protecting their documentary heritage. Besides, it’s incorrect to say that “such data in theory should be covered by WMO Resolution 40 (or 25 for hydrological data), since Annex 1 to Resolution 40 doesn’t take into account historical climate data (only actual data exchanged through GTS, not climate records). I think, we need to make a strong case of the currently limited access to historical records and emphasise that historical climate data are not covered by this WMO Resolution. I’ll do suggestions on this when commenting the WP5, since I think we are facing here the real obstacle to go forward to develop a global surface temperature dataset that is accessible and traceable. In here, I only wanted emphasise that that sentence is incomplete and partially incorrect.

    L326-333 Although authors of this White Paper (WP) are right highlighting the role of the ETCCDI on documenting changes in climate extremes over data scarce regions, it is important don’t forget that the daily data used to calculate the indices are mainly not disclosed by the participant countries to such regional workshops. More and more often the attendees to these workshops that want to publish their finding through peer-review papers are requested by reviewers to make the data available as a requisite for accepting the paper. Unfortunately, this is out of their hands.

    L340: For MEDARE you can pick up/give this URL: http://www.omm.urv.cat/MEDARE/index.html

    Recommendation 2 and 4 seem to me really reasonable and a feasible way to start and make the databank truly international. Recommendation 3, 6 and 7 rely on getting financial resources. If not so, the development/maintenance of the dataset won’t be possible to achieve. Recommendation 9 is important for pushing ahead the need for freely accessible climate records produced by NMHS, but it will not be enough. As I’ll recommend under WP5, it is also necessary to be more ambitious and make a strong case through UN and PR.

  4. I can't find any estimates of the total amount of data we're talking about, either in creation of a new surface temperature record, as narrowly interpreted, nor for any of the broader scoping this project might take on.
    Given lots of vagueness about scope of the proposed data warehouse, there's probably no single number, but some ballpark figures for different scenarios would be useful.

  5. posted on behalf of:
    Hitomi Saito (Ms.)
    Climate Prediction Division
    Global Environment and Marine Department
    Japan Meteorological Agency

    (1) I understand from Figs. 1 and 2 of this WP that Level 5 is homogenized station observation data but WP6 (line76-79) shows grid dataset is also included as Level 5. I would like to request that Level 5 should be clearly defined as homogenized station observation data.

    (2) I think “uses existing dataset for new international dataset” and “data rescue effort from original data” are both important and they should be discussed separately.

  6. Nick,

    I know the guy who built the cameras for Google's book scanning. Its an Open source design runs a linux OS on the camera. Very cool. Same guy did the Street view camera design.

  7. Both the subjects of Paper 3 and Paper 4 are rising branches. But I do not find the trunk, the plan of collection of basic data. Is it being prepared, perhaps as Paper 1 or 2? As a scientist who use climate data, I would like to show my primitive thoughts about that part (though I am afraid there is nothing new to you).

    I think that there are two basic streams of data collection. Surface Temperature Datasets for the 21th Century (I abbreviate it as STD21 from now on) should set the priority between them.
    (1) Collecting data digitized and compiled at the national level (e.g. annual or monthly reports).
    (2) Collecting reports obtained through GTS.
    Both will be necessary to minimize gaps. Where both streams are available, investment on both may be redundant. Should we prioritize on one of them, or take advantage of redundancy?

    (1) The first stream is collecting data digitized and compiled at the national level.

    Its principal part is expected to be voluntary submissions by NHMSs. If WMO decide to endorse STD21, it should recommend such actions. But perhaps financial support of WMO would be limited to technical assistance and organizing meetings. Eagerness of governments would vary from one another.

    Sometimes data compilation may be funded by national or bilateral projects of research and/or development which need climate data for their purposes. If the planners recognize STD21 as production of global public goods, they can include such agreements that they contribute their product to STD21.

    STD21 should also try to actively collect relevant reports in addition to receiving voluntary submissions.

    STD21 should make a recommendation of the contents and the format so that it can get "Level 2 data" (in the terminology of Paper 3). But, perhaps STD21 often need to receive national products as they are (which STD21 center should handle as "Level 1 data"), and then painstakingly digest them into Level 2.

    It may happen that many more parties voluntarily submit their data sets with unknown quality or provenance. Allocation of human resources even to examine them will be a problem. The STD21 staff may decline some gifts, or they may not include them to the main database, when they think them likely to be inferior. Some authority may be necessary to prevent conflicts between the staff and the donor, or between the donor and the real owner.

    (2) The second stream is collecting and sorting reports obtained through GTS.

    The data obtained from GTS are stored at various NHMSs with many gaps. STD21 should receive several major collections of GTS data, merge them together, combine them with the first stream, identify gaps, and make requests to NHMSs to fill them. STD21 should make the guideline how to give answers to the requests, and have it recommended to NMHSs by WMO.

    I have monthly CLIMAT reports in mind so far. I wish that STD21 to include daily or sub-daily time resolution if possible. If the matters discussed in Paper 4 can be realized, it is great, but then it should be augmented with delayed-mode efforts.

    One imaginable way is to formalize transmission of "the summary of the day", not only in near-real-time but also in delayed mode.

    Another way is to enhance SYNOP reports to meet needs of STD21. For surface temperature alone, it may suffice to have a mechanism of re-sending SYNOP reports in delayed mode and requesting for it. If we include precipitation, we need to cure the big problem of the SYNOP code that the amount of precipitation is an optional element so that we cannot
    confidently distinguish zero precipitation from a missing record.

  8. There are numerous sources of historical data available, many of which tend to get dismissed as 'anecdotal'. Some material is obviously much better than others, but even fragments of data can be useful, especially if it can be corroborated throiugh other, unrelated, evidence.

    It would be useful to determine the criteria that would enable us to determine what information is considered 'worthwhile; and what could be discarded (or placed into a secondary data bank for possible use at some future date)

    Tony Brown

  9. The strategy will rely on the broad scope of this project. Many collaboration and international cooperation should be required, but it depends on how long time frame we are thinking. I hope it is clarified (or defined) in the beginning of the next week meeting. I would like to know whether we are discussing to establish a semi-permanent databank (but not owned single agency) and/or for several year's trial activity.
    Even if we are scoping long-term (10 or 20 years) framework, concepts described in chapter 8 should be discussed with those described in chapters 3 and 4.