Human actions are driving changes in Earth’s atmosphere, ocean, and land surface at unprecedented rates. Fully-coupled Earth system models (ESMs) simulate physical aspects of the climate system, their interactions with terrestrial and marine ecosystems, and biogeochemical cycles. In this sense, ESMs are extremely valuable to understanding and managing planetary-scale human-environment interactions. Over the past few years, modeling centers around the world have prepared their state-of-the-art ESMs to participate in the 6th phase of the Coupled Model Intercomparison Project (CMIP6). The World Climate Research Program (WCRP) coordinates the design and distribution of ESM simulations, aiming to assess the Earth system response to various forcing trajectories—inclusive of internal climate variability and uncertainty estimates—and to understand the origins and consequences of systematic model biases.
CMIP represents a cutting-edge, collective capacity to understand and model the Earth system. The CMIP enterprise had modest beginnings in the early 1990s, with a handful of participating models each archiving a limited set of output from their simulations—growing to the present CMIP6, which includes participation of dozens of models and modeling centers, contributing many more variables at higher spatial resolution and temporal frequencies. These factors have driven a massive increase in the volume and diversity of CMIP data, making it increasingly difficult to effectively analyze. Meanwhile, the potential for connectivity among scientists has increased as collaborative platforms have emerged. However, the scientific culture could become more effective at developing community-oriented approaches to addressing the large-scale problems defining the present era, leveraging resources like CMIP to advance scientific understanding and actionable information. The complexity of the Earth system and our models of it demands deep collaboration, integrating knowledge across diverse communities. Earth system science is data intensive; barriers to the effective synthesis and interpretation of large and diverse data form a critical rate-limiting step in establishing actionable science.
It was in this context that we organized the CMIP6 Hackathon, a joint OCB/US-CLIVAR activity. The Hackathon took place 16–18 October, 2019 at three locations simultaneously: Lamont-Doherty Earth Observatory (LDEO; Palisades, NY), NCAR (Boulder, CO), and informally at the University of Washington (UW; Seattle, WA). In addition to these primary nodes, a handful of people participated remotely, some having expressed reluctance to travel due to personal carbon-conscious no-fly commitments. Participants were primarily graduate students and postdocs at U.S. universities. We structured the CMIP6 Hackathon around group projects; there was very limited direct instruction, but rather project teams worked solidly for three days on group projects with regular check-ins that established a sense of shared purpose. Rapid learning was fostered via the intense collaborations within the project teams; the groups learned from each other as they overcame technical and scientific challenges.
We invested substantial effort in advance planning of the group projects and staging of the relevant CMIP6 data. Prior to the event, we provided instructions and made use of collaborative platforms (Slack, Discourse, Google Drive, GitHub). Participants were asked to propose projects via a post on discourse.pangeo.io and ensuing discussion enabled formation of teams. Project themes spanned many sub-disciplines, including ocean physics and biogeochemistry, climate sensitivity, atmospheric circulation, and the efficacy of data compression techniques. We pre-staged CMIP6 data for the projects in two locations; (1) the National Center for Atmospheric Research’s CMIP Analysis Platform (CMIP-AP), which provides a 10 PB central data repository connected to the NCAR supercomputer, Cheyenne; and (2) the Google Cloud, taking advantage of efforts within the Pangeo Project to coordinate hosting of CMIP6 data under a Google’s Public Dataset Program. To facilitate data access within computational workflows, we developed a data cataloging utility Intake-esm, a Python package that supports data discovery and access. The hackathon focused on the application of a suite of software tools that have been a focal point of the Pangeo community: Xarray, which enables working with labeled, multi-dimensional arrays; Dask, which provides a flexible approach to enabling parallel computation; and Jupyter Notebooks, which provide an interactive computing platform supporting sharing and publication of computational narratives. Each project team worked in a GitHub repository and developed a set of Jupyter Notebooks performing and documenting computation related to their research objectives.
The CMIP6 Hackathon represented one small step toward an envisioned transformation, building communities of practice around specific problems, using effective approaches to analysis workflows to stimulate interdisciplinary collaboration and accelerated science.