Postprocessing

The postprocessing module provides functions to combine CMIP6 datasets (provided in a dictionary) based on their dataset attributes. We provide some general, highly customizable function combine_datasets, but also some simpler wrappers for often-used functionality like merging variables, concatenating members and even parsing metric coordinates (given in a seperate dictionary of datasets).

This functionality overlaps to a degree with the approach taken in intake-esm, but provides more flexibility to predetermine the order of operations and crucially, custom functions can be applied at every step.

You can still use intake-esm, but for this functionality to work please specify aggregate=False as below.

[10]:
# lets start by loading a few example dataset
from xmip.utils import google_cmip_col
from xmip.preprocessing import combined_preprocessing

col = google_cmip_col()
experiment_id='historical'
source_id = ['CanESM5-CanOE', 'GFDL-ESM4']
kwargs = {
    'zarr_kwargs':{
        'consolidated':True,
        'use_cftime':True
    },
    'aggregate':False,
    'preprocess':combined_preprocessing
}

cat_data = col.search(
    source_id=source_id,
    experiment_id=experiment_id,
    grid_label='gn',
    table_id='Omon',
    variable_id=['tos', 'zos']
)
ddict = cat_data.to_dataset_dict(**kwargs)
list(ddict.keys())

--> The keys in the returned dictionary of datasets are constructed as follows:
        'activity_id.institution_id.source_id.experiment_id.member_id.table_id.variable_id.grid_label.zstore.dcpp_init_year.version'
100.00% [12/12 00:01<00:00]
[10]:
['CMIP.CCCma.CanESM5-CanOE.historical.r2i1p2f1.Omon.tos.gn.gs://cmip6/CMIP6/CMIP/CCCma/CanESM5-CanOE/historical/r2i1p2f1/Omon/tos/gn/v20190429/.nan.20190429',
 'CMIP.CCCma.CanESM5-CanOE.historical.r1i1p2f1.Omon.zos.gn.gs://cmip6/CMIP6/CMIP/CCCma/CanESM5-CanOE/historical/r1i1p2f1/Omon/zos/gn/v20190429/.nan.20190429',
 'CMIP.CCCma.CanESM5-CanOE.historical.r3i1p2f1.Omon.tos.gn.gs://cmip6/CMIP6/CMIP/CCCma/CanESM5-CanOE/historical/r3i1p2f1/Omon/tos/gn/v20190429/.nan.20190429',
 'CMIP.CCCma.CanESM5-CanOE.historical.r3i1p2f1.Omon.zos.gn.gs://cmip6/CMIP6/CMIP/CCCma/CanESM5-CanOE/historical/r3i1p2f1/Omon/zos/gn/v20190429/.nan.20190429',
 'CMIP.NOAA-GFDL.GFDL-ESM4.historical.r3i1p1f1.Omon.zos.gn.gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-ESM4/historical/r3i1p1f1/Omon/zos/gn/v20180701/.nan.20180701',
 'CMIP.NOAA-GFDL.GFDL-ESM4.historical.r1i1p1f1.Omon.zos.gn.gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-ESM4/historical/r1i1p1f1/Omon/zos/gn/v20190726/.nan.20190726',
 'CMIP.NOAA-GFDL.GFDL-ESM4.historical.r2i1p1f1.Omon.tos.gn.gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-ESM4/historical/r2i1p1f1/Omon/tos/gn/v20180701/.nan.20180701',
 'CMIP.NOAA-GFDL.GFDL-ESM4.historical.r1i1p1f1.Omon.tos.gn.gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-ESM4/historical/r1i1p1f1/Omon/tos/gn/v20190726/.nan.20190726',
 'CMIP.CCCma.CanESM5-CanOE.historical.r2i1p2f1.Omon.zos.gn.gs://cmip6/CMIP6/CMIP/CCCma/CanESM5-CanOE/historical/r2i1p2f1/Omon/zos/gn/v20190429/.nan.20190429',
 'CMIP.CCCma.CanESM5-CanOE.historical.r1i1p2f1.Omon.tos.gn.gs://cmip6/CMIP6/CMIP/CCCma/CanESM5-CanOE/historical/r1i1p2f1/Omon/tos/gn/v20190429/.nan.20190429',
 'CMIP.NOAA-GFDL.GFDL-ESM4.historical.r3i1p1f1.Omon.tos.gn.gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-ESM4/historical/r3i1p1f1/Omon/tos/gn/v20180701/.nan.20180701',
 'CMIP.NOAA-GFDL.GFDL-ESM4.historical.r2i1p1f1.Omon.zos.gn.gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-ESM4/historical/r2i1p1f1/Omon/zos/gn/v20180701/.nan.20180701']

You can see that the dictionary contains data from different models, ensemble members, and two variables: zos (sea surface height) and tos (sea surface temperature).

Lets go through a couple of steps here to reduce these to two model datasets (each model should be one xarray dataset with both variables and concatenated members).

Merging variables

First lets merge the variables (currently in a dataset each) into a single dataset.

[11]:
from xmip.postprocessing import merge_variables

ddict_merged = merge_variables(ddict)
list(ddict_merged.keys())
[11]:
['CanESM5-CanOE.gn.historical.Omon.r2i1p2f1',
 'CanESM5-CanOE.gn.historical.Omon.r1i1p2f1',
 'CanESM5-CanOE.gn.historical.Omon.r3i1p2f1',
 'GFDL-ESM4.gn.historical.Omon.r3i1p1f1',
 'GFDL-ESM4.gn.historical.Omon.r1i1p1f1',
 'GFDL-ESM4.gn.historical.Omon.r2i1p1f1']

The keys have been replaced by an identifier based on the unique attributes that differntiate each of the datasets, but not the variable name anymore.

[12]:
# check if the merging worked
ddict_merged['GFDL-ESM4.gn.historical.Omon.r2i1p1f1']
[12]:
<xarray.Dataset>
Dimensions:        (time: 1980, y: 576, x: 720, vertex: 4, bnds: 2)
Coordinates:
  * x              (x) float64 -299.8 -299.2 -298.8 -298.2 ... 58.75 59.25 59.75
  * y              (y) float64 -77.91 -77.72 -77.54 -77.36 ... 89.47 89.68 89.89
    lat            (y, x) float32 dask.array<chunksize=(576, 720), meta=np.ndarray>
    lat_verticies  (y, x, vertex) float32 dask.array<chunksize=(576, 720, 4), meta=np.ndarray>
    lon            (y, x) float32 dask.array<chunksize=(576, 720), meta=np.ndarray>
    lon_verticies  (y, x, vertex) float32 dask.array<chunksize=(576, 720, 4), meta=np.ndarray>
  * time           (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
    time_bounds    (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
    lon_bounds     (bnds, y, x) float32 dask.array<chunksize=(1, 576, 720), meta=np.ndarray>
    lat_bounds     (bnds, y, x) float32 dask.array<chunksize=(1, 576, 720), meta=np.ndarray>
Dimensions without coordinates: vertex, bnds
Data variables:
    tos            (time, y, x) float32 dask.array<chunksize=(64, 576, 720), meta=np.ndarray>
    zos            (time, y, x) float32 dask.array<chunksize=(61, 576, 720), meta=np.ndarray>
Attributes: (12/47)
    Conventions:            CF-1.7 CMIP-6.0 UGRID-1.0
    activity_id:            CMIP
    branch_method:          standard
    branch_time_in_child:   0.0
    branch_time_in_parent:  54750.0
    comment:                <null ref>
    ...                     ...
    title:                  NOAA GFDL GFDL-ESM4 model output prepared for CMI...
    variant_info:           N/A
    variant_label:          r2i1p1f1
    version_id:             v20180701
    intake_esm_varname:     None
    original_key:           CMIP.NOAA-GFDL.GFDL-ESM4.historical.r2i1p1f1.Omon...

You can see that the datasets now in fact contain both variables in a single dataset.

Lets keep going. Since these members are all of the same length (might not be true for all CMIP6 models!) we can simply concatenate them along a new dimension:

[14]:
from xmip.postprocessing import concat_members

ddict_concat = concat_members(ddict_merged)
print(list(ddict_concat.keys()))
ddict_concat['GFDL-ESM4.gn.historical.Omon']
/home/jovyan/2_AREAS/xmip/xmip/postprocessing.py:113: UserWarning: Match attributes ['variable_id'] not found in any of the datasets.         This can happen when several combination functions are used.
  warnings.warn(
['CanESM5-CanOE.gn.historical.Omon', 'GFDL-ESM4.gn.historical.Omon']
[14]:
<xarray.Dataset>
Dimensions:        (member_id: 3, time: 1980, y: 576, x: 720, vertex: 4, bnds: 2)
Coordinates:
  * x              (x) float64 -299.8 -299.2 -298.8 -298.2 ... 58.75 59.25 59.75
  * y              (y) float64 -77.91 -77.72 -77.54 -77.36 ... 89.47 89.68 89.89
    lat            (y, x) float32 -77.91 -77.91 -77.91 ... 65.39 65.18 64.97
    lat_verticies  (y, x, vertex) float32 -78.0 -77.82 -77.82 ... 64.87 64.87
    lon            (y, x) float32 60.25 60.75 61.25 61.75 ... 59.99 59.99 60.0
    lon_verticies  (y, x, vertex) float32 60.0 60.0 60.5 60.5 ... 60.0 60.0 60.0
  * time           (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
    time_bounds    (time, bnds) object 1850-01-01 00:00:00 ... 2015-01-01 00:...
    lon_bounds     (bnds, y, x) float32 60.0 60.5 61.0 61.5 ... 59.99 60.0 60.0
    lat_bounds     (bnds, y, x) float32 -78.0 -78.0 -78.0 ... 65.39 65.18 64.97
  * member_id      (member_id) <U8 'r3i1p1f1' 'r1i1p1f1' 'r2i1p1f1'
Dimensions without coordinates: vertex, bnds
Data variables:
    zos            (member_id, time, y, x) float32 dask.array<chunksize=(1, 61, 576, 720), meta=np.ndarray>
    tos            (member_id, time, y, x) float32 dask.array<chunksize=(1, 64, 576, 720), meta=np.ndarray>
Attributes: (12/39)
    Conventions:           CF-1.7 CMIP-6.0 UGRID-1.0
    activity_id:           CMIP
    branch_method:         standard
    branch_time_in_child:  0.0
    comment:               <null ref>
    contact:               gfdl.climate.model.info@noaa.gov
    ...                    ...
    sub_experiment:        none
    sub_experiment_id:     none
    table_id:              Omon
    title:                 NOAA GFDL GFDL-ESM4 model output prepared for CMIP...
    variant_info:          N/A
    intake_esm_varname:    None

Now we have only one dataset per model and experiment, enableing us to apply some sort of analysis that depends on both variables easily. Here we plot the average SST where the SSH is smaller than 0 just for demonstration purposes.

[15]:
import matplotlib.pyplot as plt
plt.figure(figsize=[8,4])
for i, (name, ds) in enumerate(ddict_concat.items()):
    data = ds.tos.where(ds.zos<0).mean(['x','y'])
    plt.subplot(2,1,i+1)
    data.coarsen(time=12*5).mean().plot(hue='member_id')
    plt.gca().set_title(ds.attrs['source_id'])
_images/postprocessing_9_0.png

Alternatively you could try to only choose a single member from each model using pick_first_member. This can be very helpful for prototyping an analysis, since some of the models have a large number of members and as such can increase computation time significantly!

[18]:
from xmip.postprocessing import pick_first_member
ddict_single_member = pick_first_member(ddict_merged)
print(list(ddict_single_member.keys()))
ddict_single_member['GFDL-ESM4.gn.historical.Omon']
['CanESM5-CanOE.gn.historical.Omon', 'GFDL-ESM4.gn.historical.Omon']
[18]:
<xarray.Dataset>
Dimensions:        (time: 1980, y: 576, x: 720, vertex: 4, bnds: 2)
Coordinates:
  * x              (x) float64 -299.8 -299.2 -298.8 -298.2 ... 58.75 59.25 59.75
  * y              (y) float64 -77.91 -77.72 -77.54 -77.36 ... 89.47 89.68 89.89
    lat            (y, x) float32 dask.array<chunksize=(576, 720), meta=np.ndarray>
    lat_verticies  (y, x, vertex) float32 dask.array<chunksize=(576, 720, 4), meta=np.ndarray>
    lon            (y, x) float32 dask.array<chunksize=(576, 720), meta=np.ndarray>
    lon_verticies  (y, x, vertex) float32 dask.array<chunksize=(576, 720, 4), meta=np.ndarray>
  * time           (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
    time_bounds    (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
    lon_bounds     (bnds, y, x) float32 dask.array<chunksize=(1, 576, 720), meta=np.ndarray>
    lat_bounds     (bnds, y, x) float32 dask.array<chunksize=(1, 576, 720), meta=np.ndarray>
Dimensions without coordinates: vertex, bnds
Data variables:
    zos            (time, y, x) float32 dask.array<chunksize=(61, 576, 720), meta=np.ndarray>
    tos            (time, y, x) float32 dask.array<chunksize=(120, 576, 720), meta=np.ndarray>
Attributes: (12/46)
    Conventions:            CF-1.7 CMIP-6.0 UGRID-1.0
    activity_id:            CMIP
    branch_method:          standard
    branch_time_in_child:   0.0
    branch_time_in_parent:  36500.0
    comment:                <null ref>
    ...                     ...
    title:                  NOAA GFDL GFDL-ESM4 model output prepared for CMI...
    variant_info:           N/A
    variant_label:          r1i1p1f1
    version_id:             v20190726
    intake_esm_varname:     None
    original_key:           GFDL-ESM4.gn.historical.Omon.r1i1p1f1

Custom combination functions

There might be cases where the convenience wrappers are not exactly flexible enough. In that case you can always use combine_datasets with a custom function. This function takes a list of ‘matched’ xarray.Datasets as input. The matching is done by providing the attributes which should be the same in all of the matched datasets.

A common use case could be to pick just any single member from a model. This helps in developing analyses for many models without having to compute all of them at the initial stage.

[15]:
from xmip.postprocessing import combine_datasets

def pick_first_member(ds_list, **kwargs):
    return ds_list[0]

ddict_new = combine_datasets(
    ddict_merged,
    pick_first_member,
    match_attrs=['source_id', 'grid_label', 'experiment_id', 'table_id']
)
ddict_new['CanESM5-CanOE.gn.historical.Omon']
[15]:
<xarray.Dataset>
Dimensions:        (bnds: 2, time: 1980, vertex: 4, x: 360, y: 291)
Coordinates:
  * x              (x) int32 0 1 2 3 4 5 6 7 ... 352 353 354 355 356 357 358 359
  * y              (y) int32 0 1 2 3 4 5 6 7 ... 283 284 285 286 287 288 289 290
    lat            (y, x) float64 dask.array<chunksize=(291, 360), meta=np.ndarray>
    lon            (y, x) float64 dask.array<chunksize=(291, 360), meta=np.ndarray>
  * time           (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
    time_bounds    (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
    lat_verticies  (y, x, vertex) float64 dask.array<chunksize=(291, 360, 4), meta=np.ndarray>
    lon_verticies  (y, x, vertex) float64 dask.array<chunksize=(291, 360, 4), meta=np.ndarray>
  * bnds           (bnds) int64 0 1
  * vertex         (vertex) int64 0 1 2 3
    lon_bounds     (bnds, y, x) float64 dask.array<chunksize=(1, 291, 360), meta=np.ndarray>
    lat_bounds     (bnds, y, x) float64 dask.array<chunksize=(1, 291, 360), meta=np.ndarray>
Data variables:
    tos            (time, y, x) float32 dask.array<chunksize=(215, 291, 360), meta=np.ndarray>
    zos            (time, y, x) float32 dask.array<chunksize=(208, 291, 360), meta=np.ndarray>
Attributes: (12/52)
    CCCma_model_hash:            932b659de600c6a0e94f619abaf9cc79eabcd337
    CCCma_parent_runid:          canoecpl-007
    CCCma_pycmor_hash:           3ecdc18eb7c1f7fbce0346850f41adf815d9fb66
    CCCma_runid:                 c2-his02
    Conventions:                 CF-1.7 CMIP-6.2
    YMDH_branch_time_in_child:   1850:01:01:00
    ...                          ...
    title:                       CanESM5-CanOE output prepared for CMIP6
    variant_label:               r2i1p2f1
    version:                     v20190429
    version_id:                  v20190429
    intake_esm_varname:          None
    original_key:                CMIP.CCCma.CanESM5-CanOE.historical.r2i1p2f1...

Cool, now we have a dictionary where each entry only contains one member. This approach can be used to implement a variety of custom functions on matched datasets.

Combining different grids via interpolation

Not all outputs in CMIP for a specific model are provided on the same grid structure. Especially in the ocean, the native grid gn can be quite complex and some variables are regridded to regular lon/lat grids for either convenience or to save storage.

[2]:
col = google_cmip_col()
cat = col.search(
    variable_id=["thetao", "o2"],
    experiment_id=["historical"],
    source_id=["GFDL-ESM4", "CanESM5-CanOE"],
    table_id=["Omon"],
)
ddict = cat.to_dataset_dict(
    zarr_kwargs={"consolidated": True, "use_cftime": True},
    aggregate=False,
    preprocess=combined_preprocessing,
)

--> The keys in the returned dictionary of datasets are constructed as follows:
        'activity_id.institution_id.source_id.experiment_id.member_id.table_id.variable_id.grid_label.zstore.dcpp_init_year.version'
100.00% [9/9 00:01<00:00]
[3]:
list(ddict.keys())
[3]:
['CMIP.CCCma.CanESM5-CanOE.historical.r3i1p2f1.Omon.thetao.gn.gs://cmip6/CMIP6/CMIP/CCCma/CanESM5-CanOE/historical/r3i1p2f1/Omon/thetao/gn/v20190429/.nan.20190429',
 'CMIP.CCCma.CanESM5-CanOE.historical.r2i1p2f1.Omon.thetao.gn.gs://cmip6/CMIP6/CMIP/CCCma/CanESM5-CanOE/historical/r2i1p2f1/Omon/thetao/gn/v20190429/.nan.20190429',
 'CMIP.CCCma.CanESM5-CanOE.historical.r3i1p2f1.Omon.o2.gn.gs://cmip6/CMIP6/CMIP/CCCma/CanESM5-CanOE/historical/r3i1p2f1/Omon/o2/gn/v20190429/.nan.20190429',
 'CMIP.NOAA-GFDL.GFDL-ESM4.historical.r1i1p1f1.Omon.thetao.gr.gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-ESM4/historical/r1i1p1f1/Omon/thetao/gr/v20190726/.nan.20190726',
 'CMIP.NOAA-GFDL.GFDL-ESM4.historical.r1i1p1f1.Omon.thetao.gn.gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-ESM4/historical/r1i1p1f1/Omon/thetao/gn/v20190726/.nan.20190726',
 'CMIP.NOAA-GFDL.GFDL-ESM4.historical.r1i1p1f1.Omon.o2.gr.gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-ESM4/historical/r1i1p1f1/Omon/o2/gr/v20190726/.nan.20190726',
 'CMIP.CCCma.CanESM5-CanOE.historical.r2i1p2f1.Omon.o2.gn.gs://cmip6/CMIP6/CMIP/CCCma/CanESM5-CanOE/historical/r2i1p2f1/Omon/o2/gn/v20190429/.nan.20190429',
 'CMIP.CCCma.CanESM5-CanOE.historical.r1i1p2f1.Omon.o2.gn.gs://cmip6/CMIP6/CMIP/CCCma/CanESM5-CanOE/historical/r1i1p2f1/Omon/o2/gn/v20190429/.nan.20190429',
 'CMIP.CCCma.CanESM5-CanOE.historical.r1i1p2f1.Omon.thetao.gn.gs://cmip6/CMIP6/CMIP/CCCma/CanESM5-CanOE/historical/r1i1p2f1/Omon/thetao/gn/v20190429/.nan.20190429']

For the CanESM5-CanOE model, all variables are on the native gn grid. We could combine the variables simply using the logic from above. But the GFDL-ESM4 model provides oxygen o2 only on a regridded gr grid, while the potential temperature thetao is actually available for both grids. If we are fine with using the lower resolution regridded (gr) we could just filter for that during the selection, but there are cases where one might want to preserve as much detail as possible, or where another variable is only available on the native grid. In that case we can use xESMF to interpolate from one grid to another.

xmip provides a convenient wrapper which will interpolate datasets onto the desired grid if necessary:

[5]:
from xmip.postprocessing import (
    interpolate_grid_label
)
[11]:
combined_grids_dict = interpolate_grid_label(ddict, target_grid_label='gn')
/srv/conda/envs/notebook/lib/python3.8/site-packages/xesmf/frontend.py:496: FutureWarning: ``output_sizes`` should be given in the ``dask_gufunc_kwargs`` parameter. It will be removed as direct parameter in a future version.
  ds_out = xr.apply_ufunc(
[8]:
list(combined_grids_dict.keys())
[8]:
['CanESM5-CanOE.historical.Omon.r3i1p2f1',
 'CanESM5-CanOE.historical.Omon.r2i1p2f1',
 'GFDL-ESM4.historical.Omon.r1i1p1f1',
 'CanESM5-CanOE.historical.Omon.r1i1p2f1']

Ok we can see that apparently the variables have been combined for each model member. Lets take a closer look:

[9]:
combined_grids_dict['GFDL-ESM4.historical.Omon.r1i1p1f1']
[9]:
<xarray.Dataset>
Dimensions:        (bnds: 3, lev: 35, time: 1980, vertex: 4, x: 720, y: 576)
Coordinates: (12/14)
  * bnds           (bnds) float64 0.0 1.0 2.0
  * x              (x) float64 -299.8 -299.2 -298.8 -298.2 ... 58.75 59.25 59.75
  * y              (y) float64 -77.91 -77.72 -77.54 -77.36 ... 89.47 89.68 89.89
    lat            (y, x) float32 dask.array<chunksize=(576, 720), meta=np.ndarray>
    lat_verticies  (y, x, vertex) float32 dask.array<chunksize=(576, 720, 4), meta=np.ndarray>
  * lev            (lev) float64 2.5 10.0 20.0 32.5 ... 5.5e+03 6e+03 6.5e+03
    ...             ...
    lon_verticies  (y, x, vertex) float32 dask.array<chunksize=(576, 720, 4), meta=np.ndarray>
  * time           (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
    time_bounds    (time, bnds) object dask.array<chunksize=(1980, 3), meta=np.ndarray>
  * vertex         (vertex) int64 0 1 2 3
    lon_bounds     (bnds, y, x) float32 dask.array<chunksize=(2, 576, 720), meta=np.ndarray>
    lat_bounds     (bnds, y, x) float32 dask.array<chunksize=(2, 576, 720), meta=np.ndarray>
Data variables:
    thetao         (time, lev, y, x) float32 dask.array<chunksize=(4, 35, 576, 720), meta=np.ndarray>
    o2             (time, lev, y, x) float64 dask.array<chunksize=(13, 35, 576, 720), meta=np.ndarray>
Attributes: (12/41)
    Conventions:            CF-1.7 CMIP-6.0 UGRID-1.0
    activity_id:            CMIP
    branch_method:          standard
    branch_time_in_child:   0.0
    branch_time_in_parent:  36500.0
    comment:                <null ref>
    ...                     ...
    title:                  NOAA GFDL GFDL-ESM4 model output prepared for CMI...
    variant_info:           N/A
    variant_label:          r1i1p1f1
    version_id:             v20190726
    intake_esm_varname:     None
    regrid_method:          bilinear

Both data variables have the same shape, and we can confirm that these match the high resolution native grid ouput.

[10]:
ddict['CMIP.NOAA-GFDL.GFDL-ESM4.historical.r1i1p1f1.Omon.thetao.gn.gs://cmip6/CMIP6/CMIP/NOAA-GFDL/GFDL-ESM4/historical/r1i1p1f1/Omon/thetao/gn/v20190726/.nan.20190726']
[10]:
<xarray.Dataset>
Dimensions:        (bnds: 2, lev: 35, time: 1980, vertex: 4, x: 720, y: 576)
Coordinates: (12/14)
  * bnds           (bnds) float64 1.0 2.0
    lat            (y, x) float32 dask.array<chunksize=(576, 720), meta=np.ndarray>
    lat_verticies  (y, x, vertex) float32 dask.array<chunksize=(576, 720, 4), meta=np.ndarray>
  * lev            (lev) float64 2.5 10.0 20.0 32.5 ... 5.5e+03 6e+03 6.5e+03
    lev_bounds     (lev, bnds) float64 dask.array<chunksize=(35, 2), meta=np.ndarray>
    lon            (y, x) float32 dask.array<chunksize=(576, 720), meta=np.ndarray>
    ...             ...
    time_bounds    (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
  * x              (x) float64 -299.8 -299.2 -298.8 -298.2 ... 58.75 59.25 59.75
  * y              (y) float64 -77.91 -77.72 -77.54 -77.36 ... 89.47 89.68 89.89
  * vertex         (vertex) int64 0 1 2 3
    lon_bounds     (bnds, y, x) float32 dask.array<chunksize=(1, 576, 720), meta=np.ndarray>
    lat_bounds     (bnds, y, x) float32 dask.array<chunksize=(1, 576, 720), meta=np.ndarray>
Data variables:
    thetao         (time, lev, y, x) float32 dask.array<chunksize=(4, 35, 576, 720), meta=np.ndarray>
Attributes: (12/52)
    Conventions:             CF-1.7 CMIP-6.0 UGRID-1.0
    activity_id:             CMIP
    branch_method:           standard
    branch_time_in_child:    0.0
    branch_time_in_parent:   36500.0
    comment:                 <null ref>
    ...                      ...
    status:                  2019-10-25;created;by nhn2@columbia.edu
    netcdf_tracking_ids:     hdl:21.14100/6e88fbac-ee16-434e-88b4-92bd18c5300...
    version_id:              v20190726
    intake_esm_varname:      None
    intake_esm_dataset_key:  CMIP.NOAA-GFDL.GFDL-ESM4.historical.r1i1p1f1.Omo...
    original_key:            CMIP.NOAA-GFDL.GFDL-ESM4.historical.r1i1p1f1.Omo...

As a final check, lets plot some example maps:

[13]:
import matplotlib.pyplot as plt

for name, ds in combined_grids_dict.items():
    ds = ds.isel(lev=0, time=0)
    plt.figure(figsize=[10,12])
    plt.subplot(2,1,1)
    ds.thetao.plot()
    plt.title(name+' thetao')
    plt.subplot(2,1,2)
    ds.o2.plot()
    plt.title(name+' o2')
    plt.show()
CanESM5-CanOE.historical.Omon.r3i1p2f1
[########################################] | 100% Completed |  0.9s
[########################################] | 100% Completed |  0.7s
_images/postprocessing_27_1.png
CanESM5-CanOE.historical.Omon.r2i1p2f1
[########################################] | 100% Completed |  0.5s
[########################################] | 100% Completed |  0.8s
_images/postprocessing_27_3.png
GFDL-ESM4.historical.Omon.r1i1p1f1
[########################################] | 100% Completed |  1.3s
[########################################] | 100% Completed |  4.9s
_images/postprocessing_27_5.png
CanESM5-CanOE.historical.Omon.r1i1p2f1
[########################################] | 100% Completed |  0.7s
[########################################] | 100% Completed |  0.8s
_images/postprocessing_27_7.png

This looks pretty good! Some notes on the resulting data for further processing:

  • The regridding can be computationally expensive, especially for the higher res models. If you run into trouble, you might consider rechunking (along a non horizontal dimenson, e.g. depth or time)

  • As of now, we are using the xesmf defaults which can result in inconsistent positions of missing values between variables (check out the coastlines in the GFDL o2 vs thetao above). Make sure to take this into account if you are computing anything that incorporates values near the coastline or globally integrated quantities!

Handling grid metrics in CMIP6

Using the same logic as before we can start to match grid metrics to model output.

Almost any analysis with model data involves grid metrics like cell area, cell volume etc. These are essential to compute e.g. global averages since the cells in ocean models are usually not of equal size.

Within the CMIP6 archive grid metrics are stored as seperate variables like areacello (horizontal grid area) or thkcello(vertical cell thickness), and it can be rather cumbersome to match these. xmip has some functions to make this easy, which also offer some options to use metrics from e.g. another experiment in case these are not available (more on that below).

[1]:
import matplotlib.pyplot as plt
import numpy as np
plt.rcParams['figure.figsize'] = 12, 6
%config InlineBackend.figure_format = 'retina'

Lets start by loading a bunch of ocean temperature output into a dictionary of datasets

[7]:
from xmip.utils import google_cmip_col
from xmip.preprocessing import combined_preprocessing

col = google_cmip_col()
experiment_id='historical'
source_id = 'MPI-ESM1-2-LR'
kwargs = {'zarr_kwargs':{'consolidated':True, 'use_cftime':True}, 'aggregate':False, 'preprocess':combined_preprocessing}

cat_data = col.search(source_id=source_id, experiment_id=experiment_id, variable_id='tos')
ddict = cat_data.to_dataset_dict(**kwargs)

--> The keys in the returned dictionary of datasets are constructed as follows:
        'activity_id.institution_id.source_id.experiment_id.member_id.table_id.variable_id.grid_label.zstore.dcpp_init_year.version'
100.00% [20/20 00:08<00:00]

Note the aggregate=False option, which prevents the concatenation of model members in this case. I have found that in many cases this concatenation can lead to dask performance problems and I currently do not recommend it in my workflow.

Now lets repeat the step, but switch the variable_id to a metric of interest, like the cell area.

[8]:
cat_metric = col.search(source_id=source_id, experiment_id=experiment_id, variable_id='areacello')
ddict_metrics = cat_metric.to_dataset_dict(**kwargs)

--> The keys in the returned dictionary of datasets are constructed as follows:
        'activity_id.institution_id.source_id.experiment_id.member_id.table_id.variable_id.grid_label.zstore.dcpp_init_year.version'
100.00% [10/10 00:00<00:00]

You can see that there are 10 datasets loaded for both the metrics and the actual data. Lets match these up by using match_metrics

[9]:
from xmip.postprocessing import match_metrics
ddict_matched = match_metrics(ddict, ddict_metrics, ['areacello'])

You can convince yourself that all the datasets in ddict_matched now have a coordinate called areacello

[12]:
ddict_matched['CMIP.MPI-M.MPI-ESM1-2-LR.historical.r1i1p1f1.Oday.tos.gn.gs://cmip6/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r1i1p1f1/Oday/tos/gn/v20190710/.nan.20190710']
[12]:
<xarray.Dataset>
Dimensions:        (bnds: 2, time: 60265, vertex: 4, x: 256, y: 220)
Coordinates:
  * x              (x) int32 0 1 2 3 4 5 6 7 ... 248 249 250 251 252 253 254 255
  * y              (y) int32 0 1 2 3 4 5 6 7 ... 212 213 214 215 216 217 218 219
    lat            (y, x) float64 dask.array<chunksize=(220, 256), meta=np.ndarray>
    lon            (y, x) float64 dask.array<chunksize=(220, 256), meta=np.ndarray>
  * time           (time) object 1850-01-01 12:00:00 ... 2014-12-31 12:00:00
    time_bounds    (time, bnds) object dask.array<chunksize=(30133, 1), meta=np.ndarray>
    lat_verticies  (y, x, vertex) float64 dask.array<chunksize=(220, 256, 4), meta=np.ndarray>
    lon_verticies  (y, x, vertex) float64 dask.array<chunksize=(220, 256, 4), meta=np.ndarray>
  * bnds           (bnds) int64 0 1
  * vertex         (vertex) int64 0 1 2 3
    lon_bounds     (bnds, y, x) float64 dask.array<chunksize=(1, 220, 256), meta=np.ndarray>
    lat_bounds     (bnds, y, x) float64 dask.array<chunksize=(1, 220, 256), meta=np.ndarray>
    areacello      (y, x) float32 dask.array<chunksize=(220, 256), meta=np.ndarray>
Data variables:
    tos            (time, y, x) float32 dask.array<chunksize=(429, 220, 256), meta=np.ndarray>
Attributes:
    Conventions:             CF-1.7 CMIP-6.2
    activity_id:             CMIP
    branch_method:           standard
    branch_time_in_child:    0.0
    branch_time_in_parent:   0.0
    cmor_version:            3.5.0
    contact:                 cmip6-mpi-esm@dkrz.de
    creation_date:           2019-09-11T14:21:40Z
    data_specs_version:      01.00.30
    experiment:              all-forcing simulation of the recent past
    experiment_id:           historical
    external_variables:      areacello
    forcing_index:           1
    frequency:               day
    further_info_url:        https://furtherinfo.es-doc.org/CMIP6.MPI-M.MPI-E...
    grid:                    gn
    grid_label:              gn
    history:                 2019-09-11T14:21:40Z ; CMOR rewrote data to be c...
    initialization_index:    1
    institution:             Max Planck Institute for Meteorology, Hamburg 20...
    institution_id:          MPI-M
    license:                 CMIP6 model data produced by MPI-M is licensed u...
    mip_era:                 CMIP6
    nominal_resolution:      250 km
    parent_activity_id:      CMIP
    parent_experiment_id:    piControl
    parent_mip_era:          CMIP6
    parent_source_id:        MPI-ESM1-2-LR
    parent_time_units:       days since 1850-1-1 00:00:00
    parent_variant_label:    r1i1p1f1
    physics_index:           1
    product:                 model-output
    project_id:              CMIP6
    realization_index:       1
    realm:                   ocean
    references:              MPI-ESM: Mauritsen, T. et al. (2019), Developmen...
    source:                  MPI-ESM1.2-LR (2017): \naerosol: none, prescribe...
    source_id:               MPI-ESM1-2-LR
    source_type:             AOGCM
    status:                  2020-09-15;created; by gcs.cmip6.ldeo@gmail.com
    sub_experiment:          none
    sub_experiment_id:       none
    table_id:                Oday
    table_info:              Creation Date:(09 May 2019) MD5:e6ef8ececc8f3386...
    title:                   MPI-ESM1-2-LR output prepared for CMIP6
    tracking_id:             hdl:21.14100/634f31a7-9841-4bd4-8a87-fcd03ea56e4...
    variable_id:             tos
    variant_label:           r1i1p1f1
    netcdf_tracking_ids:     hdl:21.14100/634f31a7-9841-4bd4-8a87-fcd03ea56e4...
    version_id:              v20190710
    intake_esm_varname:      None
    intake_esm_dataset_key:  CMIP.MPI-M.MPI-ESM1-2-LR.historical.r1i1p1f1.Oda...

So lets put this to use. We can do a plot of properly weighted global sea surface temperature in a few lines now.

[15]:
fig, ax = plt.subplots()
for ds in ddict_matched.values():
    # calculate the weighted average over the surface level temperatures
    area = ds.areacello.fillna(0)
    da = ds.tos.isel(time=slice(0,240)).weighted(area).mean(['x','y']).squeeze().load()
    da.plot(ax=ax, label=ds.attrs['variant_label'])
ax.legend(bbox_to_anchor=(1, 1), loc='upper left')
[15]:
<matplotlib.legend.Legend at 0x1701643d0>
_images/postprocessing_40_1.png

Match multiple metrics

That was not too bad. Lets continue with an example that needs multiple metrics. For this we will load the full depth potential temperature data (thetao), the cell area (areacello) and the vertical cell thickness (thkcello).

[18]:
cat_data = col.search(source_id=source_id, experiment_id=experiment_id, variable_id='thetao')
ddict = cat_data.to_dataset_dict(**kwargs)

cat_metric = col.search(source_id=source_id, variable_id=['areacello', 'thkcello'], experiment_id='historical')
ddict_metrics = cat_metric.to_dataset_dict(**kwargs)

# Matching


--> The keys in the returned dictionary of datasets are constructed as follows:
        'activity_id.institution_id.source_id.experiment_id.member_id.table_id.variable_id.grid_label.zstore.dcpp_init_year.version'
100.00% [10/10 00:00<00:00]

--> The keys in the returned dictionary of datasets are constructed as follows:
        'activity_id.institution_id.source_id.experiment_id.member_id.table_id.variable_id.grid_label.zstore.dcpp_init_year.version'
100.00% [30/30 00:02<00:00]

Matching both metrics can be done easily

[19]:
ddict_matched_again = match_metrics(ddict, ddict_metrics, ['areacello', 'thkcello'])
ddict_matched_again['CMIP.MPI-M.MPI-ESM1-2-LR.historical.r2i1p1f1.Omon.thetao.gn.gs://cmip6/CMIP6/CMIP/MPI-M/MPI-ESM1-2-LR/historical/r2i1p1f1/Omon/thetao/gn/v20190710/.nan.20190710']
[19]:
<xarray.Dataset>
Dimensions:        (bnds: 2, lev: 40, time: 1980, vertex: 4, x: 256, y: 220)
Coordinates:
  * x              (x) int32 0 1 2 3 4 5 6 7 ... 248 249 250 251 252 253 254 255
  * y              (y) int32 0 1 2 3 4 5 6 7 ... 212 213 214 215 216 217 218 219
    lat            (y, x) float64 dask.array<chunksize=(220, 256), meta=np.ndarray>
  * lev            (lev) float64 6.0 17.0 27.0 ... 4.67e+03 5.17e+03 5.72e+03
    lev_bounds     (lev, bnds) float64 dask.array<chunksize=(40, 2), meta=np.ndarray>
    lon            (y, x) float64 dask.array<chunksize=(220, 256), meta=np.ndarray>
  * time           (time) object 1850-01-16 12:00:00 ... 2014-12-16 12:00:00
    time_bounds    (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
    lat_verticies  (y, x, vertex) float64 dask.array<chunksize=(220, 256, 4), meta=np.ndarray>
    lon_verticies  (y, x, vertex) float64 dask.array<chunksize=(220, 256, 4), meta=np.ndarray>
  * bnds           (bnds) int64 0 1
  * vertex         (vertex) int64 0 1 2 3
    lon_bounds     (bnds, y, x) float64 dask.array<chunksize=(1, 220, 256), meta=np.ndarray>
    lat_bounds     (bnds, y, x) float64 dask.array<chunksize=(1, 220, 256), meta=np.ndarray>
    areacello      (y, x) float32 dask.array<chunksize=(220, 256), meta=np.ndarray>
    thkcello       (time, lev, y, x) float32 dask.array<chunksize=(66, 40, 220, 256), meta=np.ndarray>
Data variables:
    thetao         (time, lev, y, x) float32 dask.array<chunksize=(13, 40, 220, 256), meta=np.ndarray>
Attributes:
    Conventions:             CF-1.7 CMIP-6.2
    activity_id:             CMIP
    branch_method:           standard
    branch_time_in_child:    0.0
    branch_time_in_parent:   18262.0
    cmor_version:            3.5.0
    contact:                 cmip6-mpi-esm@dkrz.de
    creation_date:           2019-09-02T13:08:50Z
    data_specs_version:      01.00.30
    experiment:              all-forcing simulation of the recent past
    experiment_id:           historical
    external_variables:      areacello volcello
    forcing_index:           1
    frequency:               mon
    further_info_url:        https://furtherinfo.es-doc.org/CMIP6.MPI-M.MPI-E...
    grid:                    gn
    grid_label:              gn
    history:                 2019-09-02T13:08:50Z ; CMOR rewrote data to be c...
    initialization_index:    1
    institution:             Max Planck Institute for Meteorology, Hamburg 20...
    institution_id:          MPI-M
    license:                 CMIP6 model data produced by MPI-M is licensed u...
    mip_era:                 CMIP6
    nominal_resolution:      250 km
    parent_activity_id:      CMIP
    parent_experiment_id:    piControl
    parent_mip_era:          CMIP6
    parent_source_id:        MPI-ESM1-2-LR
    parent_time_units:       days since 1850-1-1 00:00:00
    parent_variant_label:    r1i1p1f1
    physics_index:           1
    product:                 model-output
    project_id:              CMIP6
    realization_index:       2
    realm:                   ocean
    references:              MPI-ESM: Mauritsen, T. et al. (2019), Developmen...
    source:                  MPI-ESM1.2-LR (2017): \naerosol: none, prescribe...
    source_id:               MPI-ESM1-2-LR
    source_type:             AOGCM
    status:                  2020-07-12;created; by gcs.cmip6.ldeo@gmail.com
    sub_experiment:          none
    sub_experiment_id:       none
    table_id:                Omon
    table_info:              Creation Date:(09 May 2019) MD5:e6ef8ececc8f3386...
    title:                   MPI-ESM1-2-LR output prepared for CMIP6
    tracking_id:             hdl:21.14100/9a791e46-3b24-49f0-a628-66ef71241d7...
    variable_id:             thetao
    variant_label:           r2i1p1f1
    netcdf_tracking_ids:     hdl:21.14100/9a791e46-3b24-49f0-a628-66ef71241d7...
    version_id:              v20190710
    intake_esm_varname:      None
    intake_esm_dataset_key:  CMIP.MPI-M.MPI-ESM1-2-LR.historical.r2i1p1f1.Omo...

In order to average temperature over both depth and the whole globe, we need to weight it by cell volume, which is simply the product of area and vertical thickness.

[20]:
fig, ax = plt.subplots()
for i, ds in enumerate(ddict_matched_again.values()):
    # calculate the volume weighted mean ocean temperature
    vol = (ds.areacello * ds.thkcello)
    da = ds.thetao.isel(time=slice(-240, None)).weighted(vol.fillna(0)).mean(['x','y', 'lev']).squeeze().load()
    da.plot(ax=ax, color=f'C{i}', label=ds.attrs['variant_label'])
ax.legend(bbox_to_anchor=(1, 1), loc='upper left')
[20]:
<matplotlib.legend.Legend at 0x1790e5880>
_images/postprocessing_46_1.png

Matching metrics when the exact data is not available

Sometimes, for whatever reason, the exact metric data for a certain field might not be available. But in the case of static data, like cell area, we can assume that as long as the model (source_id) and the grid configuration (grid_label) are the same, we can use the cell area for all members.

Always check if these assumptions work in your case!

Within the pangeo cmip6 catalog there is one example of such a case for the FGOALS-f3-L model.

[21]:
cat_data = col.search(source_id='FGOALS-f3-L', variable_id='thetao', experiment_id=experiment_id, grid_label='gn', table_id='Omon')
ddict = cat_data.to_dataset_dict(**kwargs)
cat_metric = col.search(source_id='FGOALS-f3-L', variable_id='areacello', experiment_id='historical', grid_label='gn')
ddict_metrics = cat_metric.to_dataset_dict(**kwargs)

--> The keys in the returned dictionary of datasets are constructed as follows:
        'activity_id.institution_id.source_id.experiment_id.member_id.table_id.variable_id.grid_label.zstore.dcpp_init_year.version'
/Users/juliusbusecke/code/xmip/xmip/preprocessing.py:267: UserWarning: `FGOALS-f3-L` does not provide lon or lat bounds.
  warnings.warn("`FGOALS-f3-L` does not provide lon or lat bounds.")
100.00% [3/3 00:00<00:00]
/Users/juliusbusecke/code/xmip/xmip/preprocessing.py:267: UserWarning: `FGOALS-f3-L` does not provide lon or lat bounds.
  warnings.warn("`FGOALS-f3-L` does not provide lon or lat bounds.")
/Users/juliusbusecke/code/xmip/xmip/preprocessing.py:267: UserWarning: `FGOALS-f3-L` does not provide lon or lat bounds.
  warnings.warn("`FGOALS-f3-L` does not provide lon or lat bounds.")

--> The keys in the returned dictionary of datasets are constructed as follows:
        'activity_id.institution_id.source_id.experiment_id.member_id.table_id.variable_id.grid_label.zstore.dcpp_init_year.version'
/Users/juliusbusecke/code/xmip/xmip/preprocessing.py:267: UserWarning: `FGOALS-f3-L` does not provide lon or lat bounds.
  warnings.warn("`FGOALS-f3-L` does not provide lon or lat bounds.")
100.00% [1/1 00:00<00:00]
[22]:
list(ddict.keys())
[22]:
['CMIP.CAS.FGOALS-f3-L.historical.r2i1p1f1.Omon.thetao.gn.gs://cmip6/CMIP6/CMIP/CAS/FGOALS-f3-L/historical/r2i1p1f1/Omon/thetao/gn/v20191008/.nan.20191008',
 'CMIP.CAS.FGOALS-f3-L.historical.r1i1p1f1.Omon.thetao.gn.gs://cmip6/CMIP6/CMIP/CAS/FGOALS-f3-L/historical/r1i1p1f1/Omon/thetao/gn/v20190822/.nan.20190822',
 'CMIP.CAS.FGOALS-f3-L.historical.r3i1p1f1.Omon.thetao.gn.gs://cmip6/CMIP6/CMIP/CAS/FGOALS-f3-L/historical/r3i1p1f1/Omon/thetao/gn/v20191008/.nan.20191008']
[23]:
list(ddict_metrics.keys())
[23]:
['CMIP.CAS.FGOALS-f3-L.historical.r1i1p1f1.Ofx.areacello.gn.gs://cmip6/CMIP6/CMIP/CAS/FGOALS-f3-L/historical/r1i1p1f1/Ofx/areacello/gn/v20190918/.nan.20190918']

We can see that there are three members and only one of them has output for the cell area.

By default match_metrics will use any available metric data that has the same source_id and grid_label attributes.

[24]:
ddict_matched = match_metrics(ddict, ddict_metrics, ['areacello'])
[25]:
ds = ddict_matched['CMIP.CAS.FGOALS-f3-L.historical.r2i1p1f1.Omon.thetao.gn.gs://cmip6/CMIP6/CMIP/CAS/FGOALS-f3-L/historical/r2i1p1f1/Omon/thetao/gn/v20191008/.nan.20191008']
ds
[25]:
<xarray.Dataset>
Dimensions:      (bnds: 2, lev: 30, time: 1980, x: 360, y: 218)
Coordinates:
  * x            (x) int32 0 1 2 3 4 5 6 7 8 ... 352 353 354 355 356 357 358 359
  * y            (y) int32 0 1 2 3 4 5 6 7 8 ... 210 211 212 213 214 215 216 217
    lat          (y, x) float64 dask.array<chunksize=(218, 360), meta=np.ndarray>
  * lev          (lev) float64 5.0 15.0 25.0 ... 3.856e+03 4.538e+03 5.244e+03
    lev_bounds   (lev, bnds) float64 dask.array<chunksize=(30, 2), meta=np.ndarray>
    lon          (y, x) float64 dask.array<chunksize=(218, 360), meta=np.ndarray>
  * time         (time) object 1850-01-16 13:00:00.000007 ... 2014-12-16 12:0...
    time_bounds  (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
  * bnds         (bnds) int64 0 1
    areacello    (y, x) float32 dask.array<chunksize=(218, 360), meta=np.ndarray>
Data variables:
    thetao       (time, lev, y, x) float32 dask.array<chunksize=(5, 30, 218, 360), meta=np.ndarray>
Attributes:
    Conventions:             CF-1.7 CMIP-6.2
    NCO:                     netCDF Operators version 4.8.1 (Homepage = http:...
    activity_id:             CMIP
    branch_method:           Spin-up documentation
    branch_time_in_child:    674885.0
    branch_time_in_parent:   236885.0
    cmor_version:            3.4.0
    contact:                 Yongqiang Yu(yyq@lasg.iap.ac.cn)
    creation_date:           2019-10-07T20:29:01Z
    data_specs_version:      01.00.30
    experiment:              all-forcing simulation of the recent past
    experiment_id:           historical
    external_variables:      areacello volcello
    forcing_index:           1
    frequency:               mon
    further_info_url:        https://furtherinfo.es-doc.org/CMIP6.CAS.FGOALS-...
    grid:                    gs1x1
    grid_label:              gn
    history:                 Thu Nov 14 10:13:24 2019: ncatted -O -a license,...
    initialization_index:    1
    institution:             Chinese Academy of Sciences, Beijing 100029, China
    institution_id:          CAS
    license:                 CMIP6 model data produced by LASG, Institute of ...
    mip_era:                 CMIP6
    nominal_resolution:      100 km
    parent_activity_id:      CMIP
    parent_experiment_id:    piControl
    parent_mip_era:          CMIP6
    parent_source_id:        FGOALS-f3-L
    parent_time_units:       days since 0001-01-01
    parent_variant_label:    r1i1p1f1
    physics_index:           1
    product:                 model-output
    realization_index:       2
    realm:                   ocean
    references:              Model described by Yu et al.2019,
    run_variant:             forcing: black carbon aerosol only
    source:                  FGOALS-f3-L (2017): \naerosol: none\natmos: FAMI...
    source_id:               FGOALS-f3-L
    source_type:             AOGCM
    status:                  2020-12-20;created; by gcs.cmip6.ldeo@gmail.com
    sub_experiment:          none
    sub_experiment_id:       none
    table_id:                Omon
    table_info:              Creation Date:(09 May 2019) MD5:cde930676e68ac67...
    title:                   FGOALS-f3-L output prepared for CMIP6
    tracking_id:             hdl:21.14100/425a2350-df4c-4657-9b9d-5717e8cd983...
    variable_id:             thetao
    variant_label:           r2i1p1f1
    netcdf_tracking_ids:     hdl:21.14100/425a2350-df4c-4657-9b9d-5717e8cd983...
    version_id:              v20191008
    intake_esm_varname:      None
    intake_esm_dataset_key:  CMIP.CAS.FGOALS-f3-L.historical.r2i1p1f1.Omon.th...

You can see that even the other members have an areacello coordinate now. The attributes of the dataarray give some hints where this data was coming from

[26]:
ds.areacello.attrs
[26]:
{'cell_methods': 'area: sum',
 'comment': 'Horizontal area of ocean grid cells',
 'long_name': 'Grid-Cell Area for Ocean Variables',
 'standard_name': 'cell_area',
 'units': 'm2',
 'original_key': 'CMIP.CAS.FGOALS-f3-L.historical.r1i1p1f1.Ofx.areacello.gn.gs://cmip6/CMIP6/CMIP/CAS/FGOALS-f3-L/historical/r1i1p1f1/Ofx/areacello/gn/v20190918/.nan.20190918',
 'parsed_with': 'xmip/postprocessing/parse_metric'}

The original_key field preserves the dictionary key of ddict_metrics, which in this case has all the information to find the actual source file.

If you are using this withouth `intake-esm <>`__ your keys might be different and this output might be less useful.

You have fine-grained control over which datasets attributes are used for the matching between datasets and metrics.

For example, adding variant_label to the list of matched attributes, metrics will only be parse for dataset/metric pairs that have the exact same variant_label.

[27]:
ddict_matched_strict = match_metrics(ddict, ddict_metrics, ['areacello'], match_attrs=['source_id', 'grid_label', 'variant_label'])

ds_strict = ddict_matched_strict['CMIP.CAS.FGOALS-f3-L.historical.r2i1p1f1.Omon.thetao.gn.gs://cmip6/CMIP6/CMIP/CAS/FGOALS-f3-L/historical/r2i1p1f1/Omon/thetao/gn/v20191008/.nan.20191008']
ds_strict
/Users/juliusbusecke/code/xmip/xmip/postprocessing.py:126: UserWarning: No matching metrics found for areacello
  warnings.warn(f"No matching metrics found for {mv}")
[27]:
<xarray.Dataset>
Dimensions:      (bnds: 2, lev: 30, time: 1980, x: 360, y: 218)
Coordinates:
  * x            (x) int32 0 1 2 3 4 5 6 7 8 ... 352 353 354 355 356 357 358 359
  * y            (y) int32 0 1 2 3 4 5 6 7 8 ... 210 211 212 213 214 215 216 217
    lat          (y, x) float64 dask.array<chunksize=(218, 360), meta=np.ndarray>
  * lev          (lev) float64 5.0 15.0 25.0 ... 3.856e+03 4.538e+03 5.244e+03
    lev_bounds   (lev, bnds) float64 dask.array<chunksize=(30, 2), meta=np.ndarray>
    lon          (y, x) float64 dask.array<chunksize=(218, 360), meta=np.ndarray>
  * time         (time) object 1850-01-16 13:00:00.000007 ... 2014-12-16 12:0...
    time_bounds  (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
  * bnds         (bnds) int64 0 1
Data variables:
    thetao       (time, lev, y, x) float32 dask.array<chunksize=(5, 30, 218, 360), meta=np.ndarray>
Attributes:
    Conventions:             CF-1.7 CMIP-6.2
    NCO:                     netCDF Operators version 4.8.1 (Homepage = http:...
    activity_id:             CMIP
    branch_method:           Spin-up documentation
    branch_time_in_child:    674885.0
    branch_time_in_parent:   236885.0
    cmor_version:            3.4.0
    contact:                 Yongqiang Yu(yyq@lasg.iap.ac.cn)
    creation_date:           2019-10-07T20:29:01Z
    data_specs_version:      01.00.30
    experiment:              all-forcing simulation of the recent past
    experiment_id:           historical
    external_variables:      areacello volcello
    forcing_index:           1
    frequency:               mon
    further_info_url:        https://furtherinfo.es-doc.org/CMIP6.CAS.FGOALS-...
    grid:                    gs1x1
    grid_label:              gn
    history:                 Thu Nov 14 10:13:24 2019: ncatted -O -a license,...
    initialization_index:    1
    institution:             Chinese Academy of Sciences, Beijing 100029, China
    institution_id:          CAS
    license:                 CMIP6 model data produced by LASG, Institute of ...
    mip_era:                 CMIP6
    nominal_resolution:      100 km
    parent_activity_id:      CMIP
    parent_experiment_id:    piControl
    parent_mip_era:          CMIP6
    parent_source_id:        FGOALS-f3-L
    parent_time_units:       days since 0001-01-01
    parent_variant_label:    r1i1p1f1
    physics_index:           1
    product:                 model-output
    realization_index:       2
    realm:                   ocean
    references:              Model described by Yu et al.2019,
    run_variant:             forcing: black carbon aerosol only
    source:                  FGOALS-f3-L (2017): \naerosol: none\natmos: FAMI...
    source_id:               FGOALS-f3-L
    source_type:             AOGCM
    status:                  2020-12-20;created; by gcs.cmip6.ldeo@gmail.com
    sub_experiment:          none
    sub_experiment_id:       none
    table_id:                Omon
    table_info:              Creation Date:(09 May 2019) MD5:cde930676e68ac67...
    title:                   FGOALS-f3-L output prepared for CMIP6
    tracking_id:             hdl:21.14100/425a2350-df4c-4657-9b9d-5717e8cd983...
    variable_id:             thetao
    variant_label:           r2i1p1f1
    netcdf_tracking_ids:     hdl:21.14100/425a2350-df4c-4657-9b9d-5717e8cd983...
    version_id:              v20191008
    intake_esm_varname:      None
    intake_esm_dataset_key:  CMIP.CAS.FGOALS-f3-L.historical.r2i1p1f1.Omon.th...

The one member that has an exact matched metric will still get it.

[28]:
ds_strict_matched = ddict_matched_strict['CMIP.CAS.FGOALS-f3-L.historical.r1i1p1f1.Omon.thetao.gn.gs://cmip6/CMIP6/CMIP/CAS/FGOALS-f3-L/historical/r1i1p1f1/Omon/thetao/gn/v20190822/.nan.20190822']
ds_strict_matched
[28]:
<xarray.Dataset>
Dimensions:      (bnds: 2, lev: 30, time: 1980, x: 360, y: 218)
Coordinates:
  * x            (x) int32 0 1 2 3 4 5 6 7 8 ... 352 353 354 355 356 357 358 359
  * y            (y) int32 0 1 2 3 4 5 6 7 8 ... 210 211 212 213 214 215 216 217
    lat          (y, x) float64 dask.array<chunksize=(218, 360), meta=np.ndarray>
  * lev          (lev) float64 5.0 15.0 25.0 ... 3.856e+03 4.538e+03 5.244e+03
    lev_bounds   (lev, bnds) float64 dask.array<chunksize=(30, 2), meta=np.ndarray>
    lon          (y, x) float64 dask.array<chunksize=(218, 360), meta=np.ndarray>
  * time         (time) object 1850-01-16 13:00:00 ... 2014-12-16 12:00:00
    time_bounds  (time, bnds) object dask.array<chunksize=(1980, 2), meta=np.ndarray>
  * bnds         (bnds) int64 0 1
    areacello    (y, x) float32 dask.array<chunksize=(218, 360), meta=np.ndarray>
Data variables:
    thetao       (time, lev, y, x) float32 dask.array<chunksize=(10, 30, 218, 360), meta=np.ndarray>
Attributes:
    Conventions:             CF-1.7 CMIP-6.2
    activity_id:             CMIP
    branch_method:           Spin-up documentation
    branch_time_in_child:    674885.0
    branch_time_in_parent:   218635.0
    cmor_version:            3.4.0
    contact:                 Yongqiang Yu(yyq@lasg.iap.ac.cn)
    creation_date:           2019-08-22T15:40:13Z
    data_specs_version:      01.00.30
    experiment:              all-forcing simulation of the recent past
    experiment_id:           historical
    external_variables:      areacello volcello
    forcing_index:           1
    frequency:               mon
    further_info_url:        https://furtherinfo.es-doc.org/CMIP6.CAS.FGOALS-...
    grid:                    gs1x1
    grid_label:              gn
    history:                 2019-08-22T13:32:56Z ;rewrote data to be consist...
    initialization_index:    1
    institution:             Chinese Academy of Sciences, Beijing 100029, China
    institution_id:          CAS
    license:                 CMIP6 model data produced by Lawrence Livermore ...
    mip_era:                 CMIP6
    nominal_resolution:      100 km
    parent_activity_id:      CMIP
    parent_experiment_id:    piControl
    parent_mip_era:          CMIP6
    parent_source_id:        FGOALS-f3-L
    parent_time_units:       days since 0001-01-01
    parent_variant_label:    r1i1p1f1
    physics_index:           1
    product:                 model-output
    realization_index:       1
    realm:                   ocean
    references:              Model described by Yu et al.2019,
    run_variant:             forcing: black carbon aerosol only
    source:                  FGOALS-f3-L (2017): \naerosol: none\natmos: FAMI...
    source_id:               FGOALS-f3-L
    source_type:             AOGCM
    sub_experiment:          none
    sub_experiment_id:       none
    table_id:                Omon
    table_info:              Creation Date:(09 May 2019) MD5:cde930676e68ac67...
    title:                   FGOALS-f3-L output prepared for CMIP6
    tracking_id:             hdl:21.14100/1cddc583-15f9-4315-b09b-1bc2cef6198...
    variable_id:             thetao
    variant_label:           r1i1p1f1
    status:                  2019-10-25;created;by nhn2@columbia.edu
    netcdf_tracking_ids:     hdl:21.14100/1cddc583-15f9-4315-b09b-1bc2cef6198...
    version_id:              v20190822
    intake_esm_varname:      None
    intake_esm_dataset_key:  CMIP.CAS.FGOALS-f3-L.historical.r1i1p1f1.Omon.th...

You can see that the r2i1p1f1 member was not matched, while r1i1p1f1 was.

Putting it all together

Ok it is time for a more high level example!

Lets look at the global mean surface temperature for ALL available models, not just one. The process is pretty much exactly the same, you only have to modify your initial .search() query for both the data and the metrics.

I temporarily had to exclude a single member, due to a bug with intake-esm (see here for details).

[29]:
experiment_id = 'ssp585'
cat_data = col.search(variable_id='tos', experiment_id=experiment_id, grid_label='gn', table_id='Omon')

##### remove a single store
# see https://github.com/intake/intake-esm/issues/246 for details on how to modify the dataframe
df = cat_data.df
drop_idx = cat_data.df.index[cat_data.df['zstore'].str.contains('ScenarioMIP.CCCma.CanESM5.ssp585.r9i1p1f1.Omon.tos.gn')]
df = df.drop(drop_idx)
cat_data = cat_data.from_df(df=df, esmcol_data=cat_data.esmcol_data)
#####

ddict = cat_data.to_dataset_dict(**kwargs)
cat_metric = col.search(variable_id='areacello', experiment_id=experiment_id, grid_label='gn')
ddict_metrics = cat_metric.to_dataset_dict(**kwargs)
ddict_matched = match_metrics(ddict, ddict_metrics, ['areacello'], print_statistics=True)

# remove the datasets where the parsing was unsuccesful
ddict_matched_filtered = {k:ds for k,ds in ddict_matched.items() if 'areacello' in ds.variables}

--> The keys in the returned dictionary of datasets are constructed as follows:
        'activity_id.institution_id.source_id.experiment_id.member_id.table_id.variable_id.grid_label.zstore.dcpp_init_year.version'
100.00% [272/272 00:33<00:00]
/Users/juliusbusecke/code/xmip/xmip/preprocessing.py:267: UserWarning: `FGOALS-f3-L` does not provide lon or lat bounds.
  warnings.warn("`FGOALS-f3-L` does not provide lon or lat bounds.")
/Users/juliusbusecke/code/xmip/xmip/preprocessing.py:267: UserWarning: `FGOALS-f3-L` does not provide lon or lat bounds.
  warnings.warn("`FGOALS-f3-L` does not provide lon or lat bounds.")
/Users/juliusbusecke/code/xmip/xmip/preprocessing.py:267: UserWarning: `FGOALS-f3-L` does not provide lon or lat bounds.
  warnings.warn("`FGOALS-f3-L` does not provide lon or lat bounds.")

--> The keys in the returned dictionary of datasets are constructed as follows:
        'activity_id.institution_id.source_id.experiment_id.member_id.table_id.variable_id.grid_label.zstore.dcpp_init_year.version'
100.00% [184/184 00:11<00:00]
Processed 272 datasets.
Exact matches:{'areacello': 0}
Other matches:{'areacello': 239}
No match found:{'areacello': 33}
/Users/juliusbusecke/code/xmip/xmip/postprocessing.py:126: UserWarning: No matching metrics found for areacello
  warnings.warn(f"No matching metrics found for {mv}")

We have 272 datasets to start with, only 184 area outputs, but we were able to match 239 of the datasets with a metric! The ones without metrics needed to be removed for plotting.

Executing this plot takes a while. As always, ☕️ time

[42]:
models = np.sort(cat_metric.df['source_id'].unique())
fig, axarr = plt.subplots(ncols=6, nrows=5, figsize=[16, 8], sharex=True, sharey=True)
for model, ax in zip(models, axarr.flat):
    ddict_model = {k:ds for k,ds in ddict_matched_filtered.items() if model in k}
    for i, ds in enumerate(ddict_model.values()):
        pass
        # calculate the area weighted mean surface ocean temperature
        da = ds.tos.sel(time=slice('2000', '2100')).weighted(ds.areacello.fillna(0)).mean(['x','y', 'lev']).squeeze().load()
        # resample to 3month averages
        da = da.resample(time='3MS').mean()
        da.plot(ax=ax, color=f'C{1}', label=ds.attrs['variant_label'], alpha=0.5)
    ax.text(0.03,0.97,model,ha='left',va='top', transform=ax.transAxes)
    ax.set_xlabel('')
    ax.grid()
fig.subplots_adjust(hspace=0, wspace=0)
_images/postprocessing_64_0.png

Some interesting differences between models in terms of the mean temperature at the end of the century, but also in terms of variability.

I that if you are trying to work with many different models and grids this should make your life a little bit easier.

https://media.giphy.com/media/xUA7aNKumDkMaGDgn6/giphy.gif