Managing AMIP with the PCMDI Software System

(Data ingest, organization, quality control and distribution)

Peter Gleckler, Charles O'Connor, Bob Drach, Karl Taylor and Dean Williams

This evolving document provides a glimpse into PCMDI's climate data management strategy applied to AMIP.   It is not yet meant to serve as an official reference.  The authors intend to submit a more thorough document to a peer review journal.  Contact Peter Gleckler (gleckler1@llnl.gov) for more information on the AMIP database or Dean Williams (williams13@llnl.gov) regarding the PCMDI Software System.
 

Table of Contents

Overview
Data transmission standards
Data ingest and organization
Quality control
Data distribution
Summary and future directions
 

Overview


We briefly summarize how the PCMDI Software System  is used to manage AMIP.  The tools and techniques established for AMIP serve as a test bed for other projects managed by PCMDI (e.g., CMIP and SMIP data).

AMIP has been managed by PCMDI since 1990 when it was launched by the World Climate Research Programme.  Until recently handling the project data has been difficult and inefficient.   Two critical advancements have enabled the PCMDI team to develop an effective strategy to streamline the organization of terrabyte-scale data provided by more than 30 climate modelling groups.

The first breakthrough was the realization by many in the community that data transmission standards can greatly increase the efficiency of collaborative research.   Adherence to data transmission standards developed for AMIP2 has reduced the complexity of data received by PCMDI.  More recently,  community progress towards defining climate specific "metadata" standards has enabled PCMDI to provide scientists with a variety of valuable information for research, and make project data available in a form that is recognized by a variety of software tools.

The second critical advancement has been the development of Climate Data Analysis Tool (CDAT) which has revolutionized PCMDI's capability to handle climate model data.  CDAT takes full advantage of the interpreted object-oriented scripting language Python, and serves as a the foundation of the PCMDI Software System.  Data organization, complex calculations/analysis and visualization are all supported via CDAT.    One by one, the many problems associated with management of AMIP have been solved by CDAT.  CDAT is also the tool of choice for research at PCMDI.

PCMDI's strategy for managing AMIP data,  consists of: 1) Reliance on data transmission standards;  2) The ingest and organization of project data; 3) Quality control testing; and 4)  Data handling and distribution to researchers.  Each of these system components is briefly summarized below, and how they relate to AMIP research is depicted in red with this schematic.

1.  Data transmission standards for AMIP
 

As a result of the increased data complexity and scale for AMIP-II, it was clear to the PCMDI staff that there would be little chance for success without the establishment of some data transmission standards.  Participating modeling centers handle their data very differently, and there was clearly no single preferred data format. The solution was to make a compromise between two common formats, GRIB  and netCDF. To satisfy this constraint, Bob Drach and Mike Fiorino developed the (Fortran and C) Library of AMIP data Transmission Standards (LATS) which enables centers to choose between the two formats and ensure that their data adheres to recognized standards. As of May 2000, over 25 modeling centers have successfully used LATS, and it is now being used for other projects such as SMIP.   However, thanks to data handling advancements at PCMDI,  LATS is no longer required for netCDF data provided project participants adhere to community conventions (next paragraph).


2. The ingest and organization of project data

Making use of the PCMDI Software System, Charles O'Connor has revolutionized PCMDI's ability to manage climate projects.   For AMIP, LATS-generated data are filtered into GDT compliant netCDF data with a standard file structure.   (GDT is an extension of COARDS tailored for climate applications). Charles' system has ensured that most problems encountered during the QC process can be easily corrected, and he has developed a versatile "file spanning" capability enabling users to link files in CDAT. What once took (AMIP-I)  months of painstaking effort, can now be accomplished in less than a day, despite the enourmous increase in data volume and complexity.   Organization of AMIP data has proven so effective that a similar strategy is being adopted for other projects supported by PCMDI.   As a result of this 'data ingest' capability,   LATS will only be necessary for future simulations provided in GRIB.   Project participants using netCDF will only need to ensure their data is GDT or COARDS compliant and adheres to a few simple conventions for variable identification (to be described here soon).

3. The AMIP Quality Control System

The AMIP QC system consists of a suite of analysis codes and visualization utilities that enable rapid identification of problems. Karl Taylor has developed the primary quality control codes which are currently being generalized and incorporated into CDAT.  Charles Doutriaux has created GUIs (for example) driven by CDAT to facilitate viewing of QC statistics. QC tests are performed once incoming data has been ingested and organized, and problem data are either corrected or identified. Some example plots of frequently encountered problems: one and two.

4. Data handling and distribution

The central location of the AMIP database is a 1 Tb allocation on PCMDI's RAID system.  Data overflows are archived at NERSC.  Data provided to approved AMIP diagnostic subprojects is (gzip) compressed before made available on ftp servers.  Electronic distribution of data is the preferred method,  but in a few cases this is not practical and data must be distributed by tape.  Clearly, data distribution is an area where further work is needed.

5. Summary and future directions

After years of infrastructure development, efficient management of AMIP data has been realized, providing a test bed for PCMDI's management of other projects. Many PCMDI scientists have contributed to this effort which has been coordinated by Peter Gleckler.  The tools developed by the PCMDI computer science team (Paul Dubois, Bob Drach and Charlie O'Connor and Dean Williams) form the essence of PCMDI's data management. Community data conventions help ensure PCMDI provides high quality data products for community research.  Additional efforts to further streamline the system are under development (e.g., for data distribution), and will be described here as they develop.