(Data ingest, organization, quality control and distribution)
Peter Gleckler, Charles O'Connor, Bob Drach, Karl Taylor and Dean Williams
This evolving
document provides a glimpse into PCMDI's climate data management strategy
applied to AMIP. It is not yet meant to serve as an official
reference. The authors intend to submit a more thorough document
to a peer review journal. Contact Peter Gleckler (gleckler1@llnl.gov)
for more information on the AMIP database or Dean Williams (williams13@llnl.gov)
regarding the PCMDI Software System.
Table of Contents
We briefly summarize how the
PCMDI Software System
is used to manage AMIP. The tools and techniques established for
AMIP serve as a test bed for other projects managed by PCMDI (e.g., CMIP
and SMIP data).
AMIP has been managed by PCMDI since 1990 when it was launched by the World Climate Research Programme. Until recently handling the project data has been difficult and inefficient. Two critical advancements have enabled the PCMDI team to develop an effective strategy to streamline the organization of terrabyte-scale data provided by more than 30 climate modelling groups.
The first breakthrough was the realization by many in the community that data transmission standards can greatly increase the efficiency of collaborative research. Adherence to data transmission standards developed for AMIP2 has reduced the complexity of data received by PCMDI. More recently, community progress towards defining climate specific "metadata" standards has enabled PCMDI to provide scientists with a variety of valuable information for research, and make project data available in a form that is recognized by a variety of software tools.
The second critical advancement has been the development of Climate Data Analysis Tool (CDAT) which has revolutionized PCMDI's capability to handle climate model data. CDAT takes full advantage of the interpreted object-oriented scripting language Python, and serves as a the foundation of the PCMDI Software System. Data organization, complex calculations/analysis and visualization are all supported via CDAT. One by one, the many problems associated with management of AMIP have been solved by CDAT. CDAT is also the tool of choice for research at PCMDI.
PCMDI's strategy for managing AMIP data, consists of: 1) Reliance on data transmission standards; 2) The ingest and organization of project data; 3) Quality control testing; and 4) Data handling and distribution to researchers. Each of these system components is briefly summarized below, and how they relate to AMIP research is depicted in red with this schematic.
1. Data transmission standards
for AMIP
2. The ingest and organization
of project data
Making use of the PCMDI Software System, Charles O'Connor has revolutionized PCMDI's ability to manage climate projects. For AMIP, LATS-generated data are filtered into GDT compliant netCDF data with a standard file structure. (GDT is an extension of COARDS tailored for climate applications). Charles' system has ensured that most problems encountered during the QC process can be easily corrected, and he has developed a versatile "file spanning" capability enabling users to link files in CDAT. What once took (AMIP-I) months of painstaking effort, can now be accomplished in less than a day, despite the enourmous increase in data volume and complexity. Organization of AMIP data has proven so effective that a similar strategy is being adopted for other projects supported by PCMDI. As a result of this 'data ingest' capability, LATS will only be necessary for future simulations provided in GRIB. Project participants using netCDF will only need to ensure their data is GDT or COARDS compliant and adheres to a few simple conventions for variable identification (to be described here soon).
3. The AMIP Quality Control System
The AMIP QC system consists of a suite of analysis codes and visualization utilities that enable rapid identification of problems. Karl Taylor has developed the primary quality control codes which are currently being generalized and incorporated into CDAT. Charles Doutriaux has created GUIs (for example) driven by CDAT to facilitate viewing of QC statistics. QC tests are performed once incoming data has been ingested and organized, and problem data are either corrected or identified. Some example plots of frequently encountered problems: one and two.
4. Data handling and distribution
The central location of the AMIP database is a 1 Tb allocation on PCMDI's RAID system. Data overflows are archived at NERSC. Data provided to approved AMIP diagnostic subprojects is (gzip) compressed before made available on ftp servers. Electronic distribution of data is the preferred method, but in a few cases this is not practical and data must be distributed by tape. Clearly, data distribution is an area where further work is needed.
5. Summary and future directions
After years of infrastructure development,
efficient management of AMIP data has been realized, providing a test bed
for PCMDI's management of other projects. Many PCMDI scientists have contributed
to this effort which has been coordinated by Peter Gleckler. The
tools developed by the PCMDI computer science team (Paul Dubois, Bob Drach
and Charlie O'Connor and Dean Williams) form the essence of PCMDI's data
management. Community data conventions help ensure PCMDI provides high
quality data products for community research. Additional efforts
to further streamline the system are under development (e.g., for data
distribution), and will be described here as they develop.