Monday, January 28, 2013

More on efforts at data rescue and digitization - reposted press release

Note: This is a copy of a press release by two sister organizations, ACRE and IEDRO that I am reposting for them upon request. Please send further requests to the named contacts. Peter

Contact: Malia Murray
Cell: 301.938.9894

The two independent organizations agree to work together for the digital preservation of and access to global historical weather and climate data through international data rescue, and digitization efforts as well as undertake and facilitate the recovery of historical, instrumental, surface terrestrial, and marine global weather observations.

   November 26, 2012 – Toulouse, France- Over fifty international climate, social and humanities scientists and representatives from the archival and library communities with common interests in climate services, gathered at Météo-France for the 5th Atmospheric Circulation Reconstructions over the Earth (ACRE) Workshop from the 28th-30th November 2012.  There they witnessed the signing of the Memorandum of Understanding (MoU) that will join efforts within the global climate services industry. The 18th session of the Conference of the Parties to the United Nations Framework Convention on Climate Change (UNFCCC) and the 8th Session of the Conference of the Parties serving as the Meeting of the Parties to the Kyoto Protocol were meeting simultaneously at the Qatar National Convention Centre in Doha, Qatar.

   This agreement forms the foundation of the User Interface Platform (UIP), the pillar of the World Meteorological Organization (WMO), and Global Framework for Climate Services (GFCS). The partnership unites the highest caliber of international experience and resources to ensure the finest of climate services. The partnership specifically enhances the areas of data rescue (DARE), the establishment of an International Data Rescue (I-DARE) Inventory, and identification of high priority weather and climate datasets. This arrangement also opens opportunities for collaborative funding of vital historical and contemporary weather and climate data. This is essential to the provision of, and access to, climate services.

    Together the ACRE and IEDRO communities, and their various partners, will develop the largest single source of primary weather and climate services. This will create opportunities to access long records of weather data that will be available for the full range of analyses. Dr. Rob Allan, International ACRE Project Manager, “The merger of ACRE and IEDRO under a new MoU is a major step towards building the infrastructure and funding support needed to reinvigorate and sustain international data rescue activities.  It will create a platform for wider partnerships with the global community and encourage funders to see the potential value in long, historical databases of global weather for use by the climate science and applications community, policy and decision makers, educators, students and the wider public.

  The resulting climate services data will contribute high quality, high resolution, weather and climate data available through free and open exchange via the National Oceanic and Atmospheric Administration (NOAA), International Comprehensive Ocean-Atmosphere Data Set (ICOADS), the International Surface Pressure Data bank (ISPD), and the International Surface Temperature Initiative (ISTI) databases.

  The delegates also expressed the need for the establishment of an additional database where all hydrometric DR&D efforts would be listed and updated by their sponsors or program managers. IEDRO will begin building this new database once funding is secured.

For more information about this topic, or to schedule an interview with Dr. Richard Crouthamel, please contact him directly at 410.867.1124 or e-mail him at

Friday, January 25, 2013


Things have gone a bit quiet of late I realize. In part this is due to real-life which has a habit of getting in the way. But in large part its because we have been grappling with the creation of a blacklist. 'We' here is the very definition of the royal we as it would be fairer to state that Jared has been grappling with this issue.

There be gremlins in the data decks constituting some of the input data to the databank algorithm - both dubious data and geolocation metadata. We knew this from the start but have stayed blacklisting until we got the algorithm doing sort of what we thought it should and everyone was happy with it. Now we have attacked the problem for several weeks. Here are the four strands of attack:

1. Placing a running F-test through the merged series to find jumps in variance. This found a handful of intra-source cases of craziness. We will delete these stations through blacklisting.

2. Running through NCDC's pairwise homogenization algorithm to see whether any really gigantic breaks in teh series are apparent. This found no such breaks (but rest assured there are breaks and the databank is a raw data holding and not a data product per se).

3. First difference series correlations with proximal neighbors. We looked for cases where correlation was high and distance was high, correlation was low and distance was low and correlation was perfect and distance low. These were then looked at manually. Many are longitude / latitude assignation errors. For example we know Dunedin on the South Island of New Zealand is in the Eastern Hemisphere:
This is Dunedin. Beautiful place ...

And not the Western Hemisphere:
This is not the Dunedin you were looking for ... Dunedin is not the new Atlantis

 But sadly two sources have the sign switched. The algorithm does not know where Dunedin is so is doing what it is supposed to. So, we need to tell it to ignore / correct the metadata for these sources so we don't end up with a phantom station.

There are other issues than simple sign errors in lat / lon that these picked up. One of the data decks has many of its French stations longitudes inflated by a factor of 10, so a station at 1.45 degrees East is wrongly placed at 14.5 degrees East. Pacific island stations appear to have recorded under multiple names and ids which confounds the merging in many cases.

4. As should be obvious from the above we also needed to look at stations proverbially 'in the drink', so we have pulled a high resolution land-sea mask and run through all stations against that. All cases demonstrably wet (greater than 10Km = .1 degree resolution at equator and many sources are only to 0.1 degree accuracy) are getting investigated.

Investigations have used the trusty googlemaps and wikipedia route in general with other approaches where helpful. Its time consuming and thankless. The good news is 'we' (Jared) are (is) nearly there.

The whole blacklist file will be one small text file the algorithm reads and one very large pdf that justifies each line in that text file. As people find other issues (and there undoubtedly will be - we will only catch worst / most obvious offenders even after several weeks on this) we can update and rerun.

Tuesday, January 15, 2013

First public talk on Databank Merge Results: AMS Annual Meeting

While there have been previous talks and posters about the International Surface Temperature Initiative, as well as the overall structure of the Global Land Surface Databank, last week marked one of the first times our group presented work on the merged product in a public fashion. I was given the opportunity to present at the 93rd Annual Meeting of the American Meteorological Society. The room was full of climatologists on the national and international scale, and we even gained a few contacts in hopes to receive more data for our databank effort.

In order to continue our aims to be open and transparent, the abstract from the conference can be found here, and the presentation used can be located here. The presentation was also recorded, and once AMS puts the audio online, we will also try and link to it.

Wednesday, January 9, 2013

How should one update global and regional estimates and maintain long-term homogeneity?

Prompted by recent discussions in various blogs and elsewhere (I'm writing this from a flaky airport connection on a laptop so no links - sorry) it seems that, for maybe the umpteenth time, there are questions about how the various current global and some national estimates are updated. Having worked in two organizations that take two very distinct approaches I thought it worth giving some perspective. It may also help inform how others who might come in and use the databank to create new products choose to approach the issue.

The fundamental issue of how to curate and update a global, regional or national product whilst maintaining homogeneity is a vexed one. Non-climatic artifacts are not the sole preserve of the historical portion of the station records. Still today stations move, instruments change, times of observation change etc. etc. often for very good and understandable reason (and often not ...). There is no obvious best way to deal with this issue. If ignored for long enough station, local and even regional series can become highly unrealistic if large very recent biases are not dealt with.

The problem is also intrinsically inter-linked with the question as to which period of the record we should adjust for non-climatic effects. Here, at least there is general agreement that adjustment should be made to match the most recent apparently homogeneous segment so that today's readings can be easily and readily compared to our estimates of past variability and change without performing mental gymnastics.

At one extreme of the set of approaches is the CRUTEM method. Here, real-time data updates are only made to a recent period (I think still just post-2000) and no explicit assessment of homogeneity is made at the monthly update granuality (there is QC applied). Rather adjustments and new pre-2000 data effectively are caught up with major releases or network updates (e.g. with entirely new station record additions / replacements / assessments normally associated with a version increment and manuscript). This ensures values prior to a recent decade or so remain static for most month to month updates but at a potential cost if a station inhomogeneity occurs in the recent past which is de facto unaccounted for. This can only then be caught up with through a substantive update.

At the other extreme is the approach undertaken in GHCN / USHCN. Here the entire network is reassessed based upon new data receipts every night using the automated homogenization algorithm. New modern periods of records can change the identification of recent breaks in stations that contribute to the network. Because the adjustments are time-invariant deltas applied to all points prior to an identified break the impact is to change values in the deep past to better match modern data. So, the addition of station data for Jan 2013 may change values estimated for Jan 1913 (or July 1913) because the algorithm now has enough data to find a break that occurred in 2009. This then may affect the nth significant figure of the national / global calculation in 1913 on a day to day basis. This is why with GHCNv3 a system of version control of v3.x.y.z.ddmmyyyy was introduced and each version archived. If you want bit replication to be possible of your analysis then explicitly reference the version you used.

What is the optimal solution? Perhaps this is a 'How long is a piece of string?' class of question. There are very obvious benefits to either approach or any number of others. In part it depends upon the intended uses of the product. If interested in serving homogeneous station series as well as aggregated area averaged series using your best knowledge as of today perhaps something closer to NCDC's approach. If interested mainly in large scale average determination and under a reasonable null that at least on a few years timescale the inevitable new systematic artifacts average out as gaussian over broad enough space scales the CRUTEM approach makes more sense. And that, perhaps, is fundamentally why they chose these different routes ...


Saturday, January 5, 2013

High School Students Engage in Climate Research

Please note that this is a guest post by Rich Kurtz, a teacher from Commack, NY state

A few years ago I had a student interested in climate change, my job as a science teacher was to work with the student to help her develop a project.  In a circuitous way my student and I were introduced to Mr. John Buchanan, the Climate Change Student Outreach Chairperson for the Casualty Actuarial Society.  Mr. Buchanan helped us develop a project using data from logbooks of weather from the 1700’s recorded by a Philadelphia farmer, Phineas Pemberton.  
Phineas Pemberton sample log page Jan. 1790, Philadelphia

My student was given the opportunity to present her data at the 3rd ACRE Workshop, Reanalysis and Applications conference in Baltimore, MD.  That meeting opened up the door to authentic learning opportunities for my students.  At the meeting I had the privilege of meeting scientists and educators from a broad spectrum of organizations.  Those professionals inspired me to investigate the possibility of introducing my students to the issues of climate change using historical weather data.   This has been a fruitful avenue of authentic learning experiences for my high school students.  With the help of outside mentors and ambitious and hardworking students we have been able to locate and use historical weather data for science research projects.  
Currently we are engaged in two projects.  One project involves digitizing data from weather records from logbooks recorded at Erasmus HallSchool in Brooklyn, NY between 1826 and 1849.  Cary Mock of the University of South Carolina told me about the logbooks, they are housed at the New York City Historical Society.  One of my students photographed the entire set of logbooks and is using those photos to digitize the data and explore and compare weather trends and changes.   
Erasmus High sample log entry from January 1852 (Brooklyn, New York)
Another project involves a group of students who have volunteered to digitize weather and lake height data from Mohonk Preserve in the New Paltz area of New York State.  After reading about a presentation about climate change given by the director of the preserve I contacted her and asked if there was anything that my students could volunteer to help with, with respect to weather data.  She was excited to get our students involved in digitizing their weather and lake water level records going back to the 1880s.  The students are currently putting the data from the logs into a database from with they will develop research questions from which they will formulate an investigation.   
Sample log entry from Mohonk Lake Preserve area (upstate New York), January 1890
I think that there is a lot of interest among teachers to get their students involved with authentic projects.  The advantage of working on historical weather projects is that it is an area of study that merges many aspects of learning.  A historical weather project can bring together topics in history, science, math and helps students with their organizational skills.  My students sometimes have the opportunity to consult with a professional scientist.  These areas all touch upon skills that we want our students to acquire.
I would like to acknowledge some of the people who have helped me with my work with students. Mr. John Buchanan, the Climate Change Student Outreach Chairperson for the Casualty Actuarial Society.  Mr. Eric Freeman, from the National Climactic Data Center, Mr. Gilbert Compo, from the Climate Diagnostics Center NOAA and Cary Mock of the Department of Geography, University of South Carolina.

Wednesday, January 2, 2013

Databank update - nearby 'duplicates' issue raised by Nick Stokes

Climate blogger Nick Stokes provided some additional feedback upon the beta 2 release alerting us to a case whereby two records for a station were still present. This was not a bug per se. The station data in one of the data decks presented to the merge program had been adjusted and hence the data disagreed. So, the merge program was doing what it should. Based upon a combination of metadata and data agreement the probability the two records were distinct was sufficiently large to constitute a new station.

One of the issues arising from the historically fragmented way data has been held and managed is that many versions of the same station may exist across multiple holdings. Often the holding will itself be a consolidation of multiple other sources and, like a Russian doll - well, you get the picture - its a mess. So, in many cases we have little idea what has been done to the data between the original measurement and our receipt. These decks are given low priority in the merge process but ignoring them entirely would be akin to cutting one's nose off to spite one's face - they may well contain unique information.

To investigate this further and more globally we ran a variant of the code with only one line change (Jared will attest that my estimate of line changes are always an underestimate but in this case it really was one line). If the metadata and data disagreed strongly then we withheld the station. We then ran a diff on the output with and without. The check found solely stations that were likely bona fide duplicates (641 in all). This additional check will be enacted in the version 1 release (and hence there will be 641 fewer stations).

Are we there now? Not quite. We have still to do the blacklisting. This is labor-intensive stuff. We will have a post on this - what we are doing and how we are planning to document the decisions in a transparent manner - early next week time permitting.

We currently expect to release version 1 no sooner than February. But it will be better for the feedback we have received and the extra effort is worth it for a more robust set of holdings.