One of the issues arising from the historically fragmented way data has been held and managed is that many versions of the same station may exist across multiple holdings. Often the holding will itself be a consolidation of multiple other sources and, like a Russian doll - well, you get the picture - its a mess. So, in many cases we have little idea what has been done to the data between the original measurement and our receipt. These decks are given low priority in the merge process but ignoring them entirely would be akin to cutting one's nose off to spite one's face - they may well contain unique information.
To investigate this further and more globally we ran a variant of the code with only one line change (Jared will attest that my estimate of line changes are always an underestimate but in this case it really was one line). If the metadata and data disagreed strongly then we withheld the station. We then ran a diff on the output with and without. The check found solely stations that were likely bona fide duplicates (641 in all). This additional check will be enacted in the version 1 release (and hence there will be 641 fewer stations).
Are we there now? Not quite. We have still to do the blacklisting. This is labor-intensive stuff. We will have a post on this - what we are doing and how we are planning to document the decisions in a transparent manner - early next week time permitting.
We currently expect to release version 1 no sooner than February. But it will be better for the feedback we have received and the extra effort is worth it for a more robust set of holdings.
Post a Comment