Monday, July 26, 2010

White paper 5 - Data policy

The data policy white paper is now available for discussion. When making posts please remember to follow the house rules. Please also take time to read the full pdf before commenting and where possible refer to one or more of section titles, pages and line numbers to make it easy to cross-reference your comment with the document.

The recommendations are reproduced below:
1. Enhance data availability
a. Build a central databank in which both the original temperature observations as well as multiple versions of the value-added datasets, i.e., quality controlled, homogenized and gridded products, are stored and documented together (including version control). The opportunity to repeat any enhanced analysis should exist. Not only will the methods used for adding value change over time and between scientists, but the data policy will change as well.
b. Provide support for digitization of paper archives wherever they may exist with the proviso that any data (and metadata) digitized under this program be made available to the central databank.
c. Enhance the international exchange of climate data by linking this activity to joint projects of global and regional climate system monitoring and by promoting the free and open access of existing databanks in accordance with set principles, e.g., those of the GEO.

2. Enhance derived product availability
a. Accept that there is a trade off between transparency and data quantity used for derived products. Transparency and openness, which scientists (including the authors) advocate, are hampered by the data policies of national governments and their respective NMHSs. Data policy issues are persistent and unlikely to change in the near future.
b. Hold a series of workshops to homogenize data and produce a gridded dataset. The original and adjusted data might not be able to be released but the gridded dataset and information on the stations that contributed to each grid box value would be released. These gridded datasets could be used by NMHSs to monitor their climate and fit together seamlessly into a global gridded dataset.
c. Ensure that the datasets are correctly credited to their creators and that related rights issues on the original data and the value-added products are observed and made clear to potential users. The conditions will be different for bona fide research and commercial use of data.

3. Involve NMHSs from all countries
a. Acknowledge that involvement of data providers (mainly NMHSs) from countries throughout the world is essential for success, and involves more than simply sending the data to an international data centre. For all nations contributing station records to benefit from this exercise, the scientific community needs to also deliver derived climate change information which can be used to support local climate services by the NMHSs. This return of investment is of particular importance for developing countries.
b. Adopt an end-to-end approach in which data providers are engaged in the construction and use of value-added products, not only because it is at the local level where the necessary knowledge resides on the procedures and circumstances under which the observations have been made, but also because this will make it easier to overcome access restrictions to the original data.
c. Increase the pressure on those countries not inclined to follow a more open data policy by engaging with institutions widely beyond the community of research scientists, including funding bodies, the general public, policy makers and international organisations.


  1. I don't think the need for station history metadata can be stressed enough. When trying to quality control and homogenize raw data there just is no substitute for such metadata. I think very serious effort needs to be made at two different levels to obtain all possible relevant metadata.
    First, at high levels, this point needs to be made -- and made over and over again. There is an inherent "cultural problem" in that data used for climate purposes is usually based on observations taken by organizations geared towards weather forecasting, where the value of metadata is often not appreciated. So a continual "public relations" campaign is needed at higher levels.
    Second, no matter how successful the first effort, inevitably it is going to be difficult to find and obtain metadata. At lower levels it will often require a sustained personal effort over a period of time. First, the right person(s) have to be located. Second they have to be cajoled into helping out. But establishing personal contact for this purpose could reap great benefits. Unfortunately, such efforts have to be repeated for each country, and within a country perhaps several times at different institutions.
    The only way this is going to have any success is to dedicate resources to such. Monies have to be provided to create staff positions to deal with the acquisition and maintainence of metadata. There are 2 separate issues: digging up old metadata and establishing procedures so that such metadata are routinely maintained and made available in the future. What is neded is the creation of a small, but carefully chosen staff to form a metadata team with people that have both the technical expertise as well as foreign language AND people skills -- as well as a lot of patience and persistence. There should also be a budget to allow for occasional "field trips" abroad to try to obtain metadata. I don't think much progress can be made using part-time people borrowed from other projects for a limited time, as has usually been done in the past.

  2. I take strong issue with the tone of the paragraph beginning on line 96.

    Firstly, I would suspect that it is easier to get a CLIMAT message wrong that it is to get a SYNOP message wrong, especially when it comes to something as basic as temperature. There's more to go wrong in producing a CLIMAT. And in any case quantity may overcome quality - in the last 24 hours ECMWF has received some 140,000 SYNOP/METAR/SHIP observations.

    Secondly, NWP people do know how to process these observations, including the application of quality control. We have shown that the observations we hold for the past 40 years can be processed by reanalysis systems to give results that in large-area averages are not very different from Had/CRU's analyses of the supposedly superior "climate" data when we sub-sample the reanalysis data to have the same coverage as the CRUTEM data. And we provide coverage where it is lacking in the CRUTEM datsets, coverage that is influenced by sub-daily data from stations that do not provide data used in CRUTEM analysis.

    The sub-daily data also have other climate applications - identification of extremes, use in analyses to initialize short-term climate prediction, validation of climate models, ... .

    So I would urge the paragraph be written in a more positive tone to state that these data, like those exchanged in the form of monthly CLIMAT records, are an essential component of our overall climate data record.

  3. Excellent. Recommendation 1(b) is very good. If recommendation 3(c) implies a name-them-and-shame-them approach, then I think this is overdue and I applaud it. Recommendation 2(c) implies a dichotomy (between bona fide research on one side and commercial use on the other) which is an over-simplification. There is a multi-dimensional space of uses of data.

  4. I agree with the authors’ statement that main reason for the lack of a high resolution climate databank is data policy issues of many NMHS. And I’d add that one of the key priorities in order to develop such a climate databank is making a strong case for getting freely available and accessible climate data. However, there is another important added reason for the currently limited data availability: many countries have national laws protecting their historical documentary heritage, among which ancient meteorological data are found. We are facing two important problems, which would have to be smoothed if we want to develop and update such a databank.

    Unfortunately the Annex 1 to WMO Resolution 40 doesn’t take into account free exchange of climate records (current and historical data). So, seems to be clear that one front should be fostering a specific agreement involving historical climate records through using the WMO channels. I know WMO Secretariat wants to organise a big conference addressed to all NMHSs’ PRs for discussing on this issue: to elevate climate records to the category of mandatory data to be exchanged among its Members. I consider necessary to go through this way and help the WMO Secretariat to raise the issue of freely exchange of data. But it can’t be our only contribution. If we want to get a truly policy for freely climate data exchange, we have to go through other ways as well.

    As said before, other main problem to get freely access to climate records - particularly ancient/historical records- rely on national laws protecting documentary heritage. This not only affect NMHSs’ historical documentation, if not also to other data holders/sources, limiting the freely availability of imaged documents on-line, for instance. Besides, after the last year hacking to CRU and the orchestrated PR campaign against climate change science, the public have got the wrong impression that climate data are already available and freely accessible, which is far away of the reality. So, for overcome these two problems, two more ambitious campaigns that the “sustainable activities” suggested by the WP's authors should be carried out, in parallel to the “sustainable ones”. We should be more aggressive on telling the public the true: the limited data availability and the restricted accessibility to them. Many examples of it can be given.

    Our case for fostering freely exchange of relevant climate data should be very ambitious; involving to the highest international bodies (UN) and going along with a PR campaign explaining the limited climate data availability/accessibility and the need of governments for giving freely access to their holdings. And for raising this issue at the highest levels, we need to get public support, which makes a PR campaign necessary.
    Both aspects should be raised through UNFCCC/SBSTA till arriving to the UN General Assembly. And for raising these issues at the highest levels, we need to get public support, which makes a PR campaign necessary. Why not to start using the workshop statement to the media by making a clear and introductory declaration using both arguments, followed by the other more “technical” agreed aspects?

    I am aware my proposal will only work on the long-run, but we should track it if we want succeed in the development of such a traceable and regularly updated dataset. On the mid-run, ETCCDI-like activities or the other activity described in L199-213 should be carried out where possible, but being aware these activities don’t ensure accessibility to raw data and therefore their traceability will be compromised. That’s true that these activities have helped and are helping a lot for improving data availability, but they have not ensured and are not ensuring the requested reproducibility and traceability of the scientific assessments made. Let the public know the true will help for sure.

    Manola Brunet
    Centre for Climate Change,
    University Rovira i Virgili
    Tarragona (Spain)

  5. Posted on behalf of:
    Hitomi Saito (Ms.)
    Climate Prediction Division
    Global Environment and Marine Department
    Japan Meteorological Agency

    I agree with a view that it is necessary to build a system that all nations contributing station records will benefit from this exercise (line 256-263). However I think we should make sure what kinds of products NMHSs really need and what kind of system is essential for NMHSs to provide the data.

  6. It would be really nice to see a set of use cases for different users of this service. These should cover not just different uses of the derived data products, but also examples of how and why people need to do drill down to lower levels of data to explore its original source. Paul Edwards terms this "infrastructure inversion" - i.e. the need to dig down into the data collection infrastructure from time to time to consider how various forms of "data friction" can impact the validity of conclusions based on derived data products.

    In climate science, attempts at infrastructure inversion are sometimes constructive (i.e. good use cases) and sometimes destructive (so we'll need to think about abuse/misuse cases). An example constructive case would be where a scientist starts to suspect problems in the processed data are hampering progress on a particular research question, and digs down to seek potential errors or anomalies in the raw data. An example destructive case is when people want to deliberately sow doubt about a scientific result by drawing attention to weaknesses in the data collection system, whether or not these have any impact on the validity of that result.

    We need to understand these different kinds of use case, and be ready to support appropriate exercises in infrastructure inversion, and also ready to counteract inappropriate ones. One of the most likely responses to the latter is to have already done the inversion inhouse, with a clear, transparent outcome, which then undermines the attempt to use it to criticize the science. I suspect that making the entire data trail available will increase these more destructive attempts at infrastructure inversion, rather than decrease them because it offers so many more targets for misdirection. So, it would be good to see a concerted effort to understand this process, and respond appropriately.

    On a related note, there's nothing in the paper about providing user support. With more and more amateurs are having a go at data analysis, either to educate themselves about climate change, or to attempt to prove a particular point, there will be a growing demand for user support, which if not met, greatly increases the risk that misunderstandings and invalid uses of the data or analyses based on it will be circulated without challenge.

    Given the adversarial nature of some of the public discourse on climate change, there is potential for denial of service attacks on this service. I think technical denial of service attacks (e.g. swamping the data server with requests) can probably be handled easily through a distributed cloud architecture. However, there's another form of denial of service that ties up the time of the people running the service (both curators of the data bank itself, and the scientists whose work it supports). The risk is that these people can get tied up in responding to an endless series of inquiries about arcane aspects of the data, with the net effect of slowing or stopping progress on productive research (for example, where that research has the potential to produce results that are detrimental to a particular political or commercial interest). I think we need to acknowledge that as the political battle over climate change policy heats up (sorry for the pun) the likelihood of these types of information warfare being attempted grows too.

  7. Skimming though these, and the other white papers, I think you may be missing a trick. Linked data technology (RDF and SPARQL), does away with the need for a single large data repository.

    Each dataset, and meta data could be provided to the public and cross referenced with other data sets easily. New data sets could be quickly integrated.

    I would be happy to explain the technology with someone on the panel should they like.

  8. Unfortunately, the 2.a. of recommendation is true. Some countries do not easily pass their observation data. Some cooperators give us precious observation data in a secret way. We cannot tell his/her name officially so that he/she would not be blamed.