There is a meeting happening this week in Geneva in preparation for an Extraordinary Meeting of the World Meteorological Organization Congress. The meeting details can be found here. A link to a local low-resolution (still 3Mb for those on slow connections) version of the poster can be found here. The poster outlines progress to date and highlights that there is much that remains to be done. Given that many of the great and the good of the meteorological world are in attendance including delegations from most National Meteorological Services it is hoped that this poster can raise awareness of the databank and help gain access to additional data and promote additional data rescue efforts. That said, nothing is ever likely to change overnight ...
Friday, October 26, 2012
Thursday, October 18, 2012
How do you decide if a station is to be merged, added as unique or withheld?
The above flowchart is a simple visualization on how the merge program works. As you can see there are a number of different options the candidate station can go through. I'm confident enough to say that each and every situation above happens at least once in the recommended merge!
Let's break down this flowchart, starting with the metadata check. The candidate station runs through all the target stations and calculates three metrics:
- distance probability
- height probability
- station name similarity using Jaccard Index
Using a threshold of 0.50 we then see what the next step is. If no metadata probability values exceed this threshold, then we check the validity of the individual metadata metric. If it turns out that 2 metrics are really good (> 0.90) and the third one is terrible, we then determine that there is bad or incomplete metadata, and the station is withheld. Otherwise we are confident that the station is unique in its own right and we add it to the target dataset.
If any stations exceed the threshold of 0.50, then we move down the left side of the chart. The next step is data comparisons. Using an overlap of no less than 5 years, we calculate the Index of Agreement, which is a "goodness-of-fit" measure similar to the coefficient of determination, however not as sensitive to outliers. Similar to the metadata probability, this is calculated between the candidate station and all target stations with metadata probability values greater than 0.50.
We then check to see if any comparisons were made. If not, then that means the two stations did not have any overlap period, or it had some overlap, but it didn't exceed the 5 year threshold. At this point one of two different things can happen. We look at the target station with the highest metadata probability. First, if the best probability is greater than 0.85, then the station merges. If not, then it is withheld.
If a data comparison was made via the Index of Agreement, then a lookup table takes into account both IA and the overlap period and creates a probability of station match, as well as station uniqueness. These are then recombined with the metadata probability to form a posterior probability of station "sameness" and station "uniqueness". If any one of these probabilities pass the same threshold, then the candidate station merges with that target station. If no same probabilities pass the same threshold, but a unique probability passes the unique threshold, then the candidate station is unique. Otherwise, the station is withheld.
A more detailed description of the above flowchart can be found here.
Tuesday, October 16, 2012
How do I work out where a station series in the merged product originates?
The above image is an example station from Logan International Airport in Boston, Massachusetts, USA. There were three different sources that went into this merged station. For every temperature value in the merged product, there is a corresponding number. That number represents the spot in the source hierarchy used to merge the stations. Using that number, one can find the station source.
Using the above image, it can be found that the sources belong to GHCN-Daily (source #01, black), russsource (source # 35, red), and ghcnmv2 (source #39, blue). Now that the sources are known, one can find the Stage 2 data for this station. A user can also look further back, and find the original digitized copy (Stage 1), and sometimes even the original paper copy (Stage 0).
Using the above image, it can be found that the sources belong to GHCN-Daily (source #01, black), russsource (source # 35, red), and ghcnmv2 (source #39, blue). Now that the sources are known, one can find the Stage 2 data for this station. A user can also look further back, and find the original digitized copy (Stage 1), and sometimes even the original paper copy (Stage 0).
Wednesday, October 10, 2012
Where is the databank merge code? How can I make it work?
The merge code was written in FORTRAN 95 and is located within each variants directory. For example, if one were to find the code for the recommended merge, the location is here:
ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/monthly/stage3/recommended/code/
Once uncompressed, there are four files that are required for the program to run correctly. The program will fail if any one of these files are missing. A description of each file is below:
databank_sources.txt: This is the prioritized list of sources that go into the merge program. This file tells the user the name of the source, number of stations, whether it was originally a monthly or daily source, and whether it includes TMAX, TMIN, or TAVG temperature. In order to acquire the source data, one is required to grab the data from the databank monthly stage 2 FTP site.
lookup_IA.txt: This is the lookup table the program reads in to determine the probability of station match and station uniqueness.
merge_module.f95: This is a module the main program calls to when performing certain functions. This was done so simple procedures called multiple of times were only written once. In addition, this provides the user the opportunity to write in their own code and compare results.
merge_main.f95: This is the main program. The first section, named "User Defined Thresholds," is where the user can define directory structures and performance thresholds.
A more description about these files, along with justification, can be found in the merging methodology document. A compiler is required to run the program. There are many different FORTRAN compilers, however the code was written so that it can comply with the g95 compiler, which is free and available to the public here.
Once a compiler (such as g95) is installed, simply type in the following command:
g95 merge_module.f95 merge_main.f95
And you should be good to go! Although not required, the user is strongly encouraged to tweak any of the thresholds and/or priority list in order to achieve different results.
If here are any questions, feel free to send an e-mail to data.submission@surfacetemperatures.org, or simply comment on this post
ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/monthly/stage3/recommended/code/
Once uncompressed, there are four files that are required for the program to run correctly. The program will fail if any one of these files are missing. A description of each file is below:
databank_sources.txt: This is the prioritized list of sources that go into the merge program. This file tells the user the name of the source, number of stations, whether it was originally a monthly or daily source, and whether it includes TMAX, TMIN, or TAVG temperature. In order to acquire the source data, one is required to grab the data from the databank monthly stage 2 FTP site.
lookup_IA.txt: This is the lookup table the program reads in to determine the probability of station match and station uniqueness.
merge_module.f95: This is a module the main program calls to when performing certain functions. This was done so simple procedures called multiple of times were only written once. In addition, this provides the user the opportunity to write in their own code and compare results.
merge_main.f95: This is the main program. The first section, named "User Defined Thresholds," is where the user can define directory structures and performance thresholds.
A more description about these files, along with justification, can be found in the merging methodology document. A compiler is required to run the program. There are many different FORTRAN compilers, however the code was written so that it can comply with the g95 compiler, which is free and available to the public here.
Once a compiler (such as g95) is installed, simply type in the following command:
g95 merge_module.f95 merge_main.f95
And you should be good to go! Although not required, the user is strongly encouraged to tweak any of the thresholds and/or priority list in order to achieve different results.
If here are any questions, feel free to send an e-mail to data.submission@surfacetemperatures.org, or simply comment on this post
Tuesday, October 9, 2012
Why are there several variants of the databank merge?
Historically the storage, sharing and rescue of land meteorological data holdings has been incredibly fractured in nature. Different groups and entities at different times have undertaken collections in different ways, often using different identifiers, station names, station locations (and precision) and averaging techniques. Hence the same station may well exist in several of the constituent Stage 2 data decks but with subtely (or not so) different geo-location or data characteristics.
Neither an analyst nor an automated procedure will make the right call 100% of the time. However, an automated procedure does allow us to spin off some plausible alternatives. By spinning off such alternatives we can allow subsequent teams of analysts looking at creating quality controlled and homogenized products to assess the uncertainty in their results to these choices. The alternative is to ignore such uncertainty and yet it clearly will have some bearing on the final data products where errors at this stage (merging when unique, deeming unique when the same record) will affect the final results.
Several members of the Working Group suggested different merge priorities sampling the choice of source decks and their ordering as well as the parameter choices within the code itself. From the methodology summary:
Variant One (colin)
In this variant, the source deck is shifted to prioritize sources that originated from their respective National Meteorological Agencies (NMA’s). This way, the most up to date locally compiled data is favored over consolidated repositories, which may or may not be up to date. In addition, sources that are either raw or quality controlled are favored over homogenized sources.
Variant Two (david)
Here, NMA’s are favored, having TMAX, TMIN, and comprehensive metadata as the highest priority. The overlap threshold is lowered from 60 months to 24 months, in order for more data comparisons to be made.
Variant Three (peter)
The source deck is changed under the following considerations. No TAVG source (or data from mixed sources) is ingested into the merge. This is because there is uncertainty in the calculation of TAVG (ie, it is not always TMAX+TMIN/2). TAVG in the final product is only generated from its respective TMAX and TMIN value. For the remaining sources, GHCN-D is the highest priority, and the rest are ranked by order of longest station record present within the source deck, from longest to shortest. The metadata equation is changed to give weighting to the distance probability (10) over the height (1) and Jaccard (1) probabilities (default is 9, 1, and 5, respectively). Finally the thresholds to merge and unique the station are lowered and favored to merge more stations.
Variant Four (jay)
Within the algorithm, the data comparison test results in three distinct possibilities. The station is merged, unique, or withheld. In this variant, this is altered so the candidate station is either merged or unique.
Variant Five (matt)
All homogenized sources are removed. Nothing else is altered compared to the recommended merge.
Variant Six (more-unique)
Thresholds are adjusted to make more candidate stations unique, thus increasing the overall station count.
Variant Seven (more-merged)
Thresholds are adjusted to make more candidate stations merge with target stations, thus decreasing the overall station count.
These have a substantive impact on several aspects of behavior. Most notably:
Station count
Gridbox coverage
Timeseries behavior
The outlier in each is the 'Peter' variant (yep, that is me) and results from the fact that early records have very few max / min measurements as currently archived. The lower anomaly early in this variant is a result of sampling and not a reflection of fundamental discrepancies. If we sub-sample the other variants to the same very restricted geographic sample they fall back into agreement.
This reflects the very real importance of doing data rescue. We know there are as many data pre-1960 in image / hardcopy only as have been digitized. This will be returned to at a later date.
Of course, if you don't like these variants or just want a play you can create your very own variant simply by downloading and using the code that has been made available alongside the beta release.
Neither an analyst nor an automated procedure will make the right call 100% of the time. However, an automated procedure does allow us to spin off some plausible alternatives. By spinning off such alternatives we can allow subsequent teams of analysts looking at creating quality controlled and homogenized products to assess the uncertainty in their results to these choices. The alternative is to ignore such uncertainty and yet it clearly will have some bearing on the final data products where errors at this stage (merging when unique, deeming unique when the same record) will affect the final results.
Several members of the Working Group suggested different merge priorities sampling the choice of source decks and their ordering as well as the parameter choices within the code itself. From the methodology summary:
Variant One (colin)
In this variant, the source deck is shifted to prioritize sources that originated from their respective National Meteorological Agencies (NMA’s). This way, the most up to date locally compiled data is favored over consolidated repositories, which may or may not be up to date. In addition, sources that are either raw or quality controlled are favored over homogenized sources.
Variant Two (david)
Here, NMA’s are favored, having TMAX, TMIN, and comprehensive metadata as the highest priority. The overlap threshold is lowered from 60 months to 24 months, in order for more data comparisons to be made.
Variant Three (peter)
The source deck is changed under the following considerations. No TAVG source (or data from mixed sources) is ingested into the merge. This is because there is uncertainty in the calculation of TAVG (ie, it is not always TMAX+TMIN/2). TAVG in the final product is only generated from its respective TMAX and TMIN value. For the remaining sources, GHCN-D is the highest priority, and the rest are ranked by order of longest station record present within the source deck, from longest to shortest. The metadata equation is changed to give weighting to the distance probability (10) over the height (1) and Jaccard (1) probabilities (default is 9, 1, and 5, respectively). Finally the thresholds to merge and unique the station are lowered and favored to merge more stations.
Variant Four (jay)
Within the algorithm, the data comparison test results in three distinct possibilities. The station is merged, unique, or withheld. In this variant, this is altered so the candidate station is either merged or unique.
Variant Five (matt)
All homogenized sources are removed. Nothing else is altered compared to the recommended merge.
Variant Six (more-unique)
Thresholds are adjusted to make more candidate stations unique, thus increasing the overall station count.
Variant Seven (more-merged)
Thresholds are adjusted to make more candidate stations merge with target stations, thus decreasing the overall station count.
These have a substantive impact on several aspects of behavior. Most notably:
Station count
Gridbox coverage
Timeseries behavior
The outlier in each is the 'Peter' variant (yep, that is me) and results from the fact that early records have very few max / min measurements as currently archived. The lower anomaly early in this variant is a result of sampling and not a reflection of fundamental discrepancies. If we sub-sample the other variants to the same very restricted geographic sample they fall back into agreement.
This reflects the very real importance of doing data rescue. We know there are as many data pre-1960 in image / hardcopy only as have been digitized. This will be returned to at a later date.
Of course, if you don't like these variants or just want a play you can create your very own variant simply by downloading and using the code that has been made available alongside the beta release.
Friday, October 5, 2012
Is it too late to submit data for inclusion in the databank?
Short answer: No
Slightly longer answer: We are still accepting data
submissions for inclusion in the first version release until November 30th.
At that point we shall provide an updated beta release version with any new
data sources that have been received. Even then, there is no end-point for
submissions that can be included in subsequent version releases. There is also
no point at which we are likely to have ‘too much’ data so any data is useful.
More detail:
Data submissions can range from a single station to large
consolidated holdings. Because the merge program attempts to discriminate
between different sources, so long as sufficiently accurate geo-location
metadata are provided (latitude, longitude, station name and elevation) it
should be able to cope with a degree of information redundancy. It is therefore
not necessary to ascertain first whether a version of each candidate station
record already exists. Particularly if the submission has greater provenance (a
link to the original in hard copy / image form, better station metadata
including a history of observing practices and instrument changes etc.) it will
likely be given priority. So, do not
worry as to whether the data already exists in the Stage 2 holdings unless it
is simply a duplicate resubmission of a pre-existing holding (obviously).
If you need help in negotiating release of data to the
databank there exist a boilerplate letter of support and a certificate of
appreciation (the latter on request). Further, case specific, help can be
provided by Databank Working Group members upon request.
Once you have the data we have tried to make its submission
as easy as possible. There are submission guidelines which provide the details
of what data is required and how to submit. We do not require that data be
converted from whatever the native digital format is, in fact we prefer you not
to as this may yield errors that are undetectable. You do, however, need to describe the format
sufficiently that a conversion script can be written to convert it to stage 2.
Although the first Stage 3 merged release consists of solely
monthly resolution temperature data we strongly encourage submission of data on
timescales at one or more of sub-daily, daily and monthly resolution and for
multiple meteorological elements and not just temperatures. It is hoped that
future releases will include such shorter timescales and additional
meteorological elements to just temperature. These will be useful for many
scientists and end-users beyond the more restricted aims of the International
Surface Temperature Initiative.
Thursday, October 4, 2012
What known issues remain to be addressed with the databank during beta release?
First and foremost, this beta release affords the
opportunity for broader community input to the databank process. Having many
eyes on the prize, hands turning over the rocks and boots kicking the tires
will make this thing better. So, there
will undoubtedly be a number of issues in addition to those highlighted here
that will arise. We welcome such
feedback. There are a number of issues which we know we will address during
beta:
- Where we have daily sources we will append provenance flags which specify how many daily reports went into each monthly value. This may prove useful to analysts down the line. Where we only have monthlies we will append this flag as a missing value indicator
- We will create a version which re-injects the element-wise provenance flags from the constituent stage 2 (source deck) holdings into stage 3 (merged) holdings. For computational efficiency it is impossible to carry the flags through the merge program, but, obviously, information is available to do this as post-processing.
- We are well aware that some stations will have poor geo-location metadata (i.e. Spanish stations in the Sahara or older station segments using a different meridian to Greenwich). At present no blacklisting is applied. But we will definitely apply such blacklisting to the first version merge. We are still in the process of collating a list of known geo-location errors to apply such a fix. One of two things will be done: a correction to the geo-location data where this is known and generally accepted; or force the code to withhold the station from the merge. If you find an apparent issue with a station’s location please let us know either through the blog or data.submission@surfacetemperatures.org so that we can investigate and determine what to do.
- We plan to release all the stage 1 to stage 2 conversion scripts for completeness. This will be done soon. There are only so many hours in the day and getting the merge in order has been the priority.
- We plan to create several output formats to aid usability. These are hoped to include cf compliant netcdf. For now the databank is made available in two ASCII-based formats.
- Instigation of a consistent station identifier system that is robust to future data additions and deletions and is consistent with daily identifiers moving forwards.
Beta release of first version of global land surface databank
Today marks the release of the first beta version of the
global land surface databank constructed under the auspices of the
International Surface Temperature Initiative’s Databank Working Group. The
release is of monthly average temperatures from stations around the globe that
have been made available without restriction.
The release will be in beta for a period of 3 months before
an official first version release. It is hoped that during this time users can
take a look and provide feedback (preferably through the Initiative blog) and
advice to ensure that the first version release is of the highest possible
quality. Additional data submissions received prior to November 30th
will be incorporated in the first version release.
The release consists of:
·
Over 40 distinct source decks (compilations /
holdings) submitted to the databank to date in Stage 0 (hardcopy / image; where
available), Stage 1 (native digital format), and Stage 2 (converted to common
format and with provenance flags).
·
A recommended merged product and several
variants thereon which have all been built off the stage 2 holdings
·
All code used to process the data merge
·
Documentation necessary to understand at a high
level the processing of the data
The release is available from ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/.
The merged product can be found at ftp://ftp.ncdc.noaa.gov/pub/data/globaldatabank/monthly/stage3/
. The recommended merge consists of over 39 thousand stations, which range in
length from a few years to over two Centuries.
This is data that mostly has not been quality controlled or
bias corrected. It is important to stress that it therefore does not constitute
a climate data record / dataset suitable for monitoring long-term changes.
Rather, it provides a basis from which research groups can create algorithms to
produce climate datasets. The results from these algorithms can then be
compared and benchmarked as part of the International Surface Temperature
Initiative activities. We hope that many groups and individuals take up this
challenge which will lead to improved understanding of land surface air
temperature changes particularly at regional scales.
This release is the culmination of two years effort by an
international group of scientists to produce a truly comprehensive, open and
transparent set of fundamental monthly data holdings. In the coming weeks a
number of additional postings to the blog will attempt to explain different
aspects of this databank.
More information on the Initiative and how to get involved
can be found at www.surfacetemperatures.org
.
Subscribe to:
Posts (Atom)