Wednesday, January 2, 2013

Databank update - nearby 'duplicates' issue raised by Nick Stokes

Climate blogger Nick Stokes provided some additional feedback upon the beta 2 release alerting us to a case whereby two records for a station were still present. This was not a bug per se. The station data in one of the data decks presented to the merge program had been adjusted and hence the data disagreed. So, the merge program was doing what it should. Based upon a combination of metadata and data agreement the probability the two records were distinct was sufficiently large to constitute a new station.

One of the issues arising from the historically fragmented way data has been held and managed is that many versions of the same station may exist across multiple holdings. Often the holding will itself be a consolidation of multiple other sources and, like a Russian doll - well, you get the picture - its a mess. So, in many cases we have little idea what has been done to the data between the original measurement and our receipt. These decks are given low priority in the merge process but ignoring them entirely would be akin to cutting one's nose off to spite one's face - they may well contain unique information.

To investigate this further and more globally we ran a variant of the code with only one line change (Jared will attest that my estimate of line changes are always an underestimate but in this case it really was one line). If the metadata and data disagreed strongly then we withheld the station. We then ran a diff on the output with and without. The check found solely stations that were likely bona fide duplicates (641 in all). This additional check will be enacted in the version 1 release (and hence there will be 641 fewer stations).

Are we there now? Not quite. We have still to do the blacklisting. This is labor-intensive stuff. We will have a post on this - what we are doing and how we are planning to document the decisions in a transparent manner - early next week time permitting.

We currently expect to release version 1 no sooner than February. But it will be better for the feedback we have received and the extra effort is worth it for a more robust set of holdings.

No comments:

Post a Comment