The beta3 release can be found here: ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/monthly/stage3/. Within that directory one can find all the data and code used, along with some graphics depicting the results of all the merge variants.
In addition, the previous betas are still available to look at, if anyone wishes to run comparisons
- beta 1: ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/archive/monthly/stage3/beta1/
- beta 2: ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/archive/monthly/stage3/beta2/
- A blacklist of candidate stations was generated to either fix known errors with its metadata/data, or withhold the station completely. This is a required input file for the code to run and is provided with this beta release
- Some minor code changes were applied, including withholding stations when the metadata probability was near perfect, but the data comparisons were so poor the station became unique (when it should have merged). In addition, odd characters were removed from the station name before the Jaccard Index was run.
- The format of stage 3 data was changed so that it was consistent with all stage 2 data. In addition, all data provenance flags have been ported over in order to be open and transparent
- Algorithm output is included with each variant result, in order to provide information about each candidate station and how it made it's decision to merge / unique / withhold. A future post will go into great detail about each output file.
If you wish to provide comments, please feel free to send an e-mail to email@example.com