OceanTACO Data Retrieval Workflows#

This notebook is a self-contained guide to OceanTACO retrieval APIs before ML batching.

It covers:

  • Catalog loading and file discovery

  • Direct file fetch internals

  • Tile and bbox retrieval

  • SWOT-specific merge behavior

  • Multi-source time series loading

  • Edge-case handling and HuggingFace streaming

1. Install Profiles#

# Verify ocean_taco is installed.
# Uncomment one line to install inside this notebook:
# %pip install "ocean_taco[hf] @ git+https://github.com/nilsleh/oceanTACO.git@main"
# %pip install ocean-taco[hf]
import ocean_taco
print(f"ocean_taco {ocean_taco.__version__} ready")
ocean_taco 0.1.0 ready
from pathlib import Path
from IPython.display import display

from ocean_taco.dataset.retrieve import (
    HF_DEFAULT_URL,
    _iter_dates,
    _normalize_source_filename,
    _source_token,
    _tile_bbox,
    fetch_nc,
    load_bbox_nc,
    load_bbox_swot_nc,
    load_hf_dataset,
    load_multisource_time_series_nc,
    load_region_product_nc,
    load_tile_nc,
    query_files,
)
# -- Configuration --
HF_URL = HF_DEFAULT_URL
CACHE_DIR = "./cache"

# Keep defaults small for fast tutorial execution.
DATE = "2024-06-01"
DATE_START = "2024-06-01"
DATE_END = "2024-06-03"

TILE = "NORTH_ATLANTIC"
BBOX = (-95.0, -5.0, -85.0, 5.0)

DATA_SOURCE = "l4_sst"
MULTI_SOURCES = ["l4_ssh", "l4_sst", "glorys"]

Path(CACHE_DIR).mkdir(parents=True, exist_ok=True)
print("Configuration ready")
Configuration ready

2. Retrieval API Map#

Main retrieval functions in ocean_taco.dataset.retrieve:

  • load_hf_dataset(url): open catalog metadata from HuggingFace

  • query_files(dataset, date, bbox, needed_files): filter catalog records

  • fetch_nc(row): download one NetCDF row into an xarray.Dataset

  • load_tile_nc(...): one tile, one source, one date

  • load_region_product_nc(...): alias of load_tile_nc

  • load_bbox_nc(...): all overlapping tiles for bbox + merge

  • load_bbox_swot_nc(...): SWOT-specific bbox load with non-spatial dim cleanup

  • load_multisource_time_series_nc(...): source x date-range loader

catalog = load_hf_dataset(HF_URL)
print("Catalog loaded:", type(catalog))
Catalog loaded: <class 'tacoreader.dataset.TacoDataset'>

3. File Discovery Internals (query_files)#

query_files chains:

  1. filter_datetime(f"{date}/{date}")

  2. filter_bbox(*bbox, level=1)

  3. flatten()

  4. filename filter via l2:id

Bounding box order is (lon_min, lat_min, lon_max, lat_max).

resolved_name = _normalize_source_filename(DATA_SOURCE)
discovered = query_files(catalog, DATE, _tile_bbox(TILE), {resolved_name})

print("Resolved filename:", resolved_name)
print("Rows found:", len(discovered))
if len(discovered) > 0:
    preview_cols = [c for c in ["l2:id", "gdal_vsi", "internal:gdal_vsi", "url", "href"] if c in discovered.columns]
    display(discovered[preview_cols].head(3))
Resolved filename: l4_sst.nc
Rows found: 6
l2:id gdal_vsi
7 l4_sst.nc /vsicurl/https://huggingface.co/datasets/nilsl...
18 l4_sst.nc /vsicurl/https://huggingface.co/datasets/nilsl...
29 l4_sst.nc /vsicurl/https://huggingface.co/datasets/nilsl...

4. Single Row Fetch Internals (fetch_nc)#

fetch_nc scans URL-like columns in this order: internal:gdal_vsi, gdal_vsi, url, href, then reads NetCDF with h5netcdf.

if len(discovered) == 0:
    print("No rows available to demonstrate fetch_nc.")
else:
    fname, ds_one = fetch_nc(discovered.iloc[0])
    print("Fetched file:", fname)
    print("Dataset dims:", dict(ds_one.dims))
    print("Data variables:", list(ds_one.data_vars)[:8])
Fetched file: l4_sst.nc
Dataset dims: {'lat': 800, 'lon': 900}
Data variables: ['analysed_sst', 'analysis_error', 'mask', 'sea_ice_fraction']
/tmp/ipykernel_41737/945767058.py:6: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
  print("Dataset dims:", dict(ds_one.dims))

5. One Exact Tile (load_tile_nc)#

Use this path for deterministic region-level analysis.

tile_ds = load_tile_nc(
    dataset=catalog,
    date=DATE,
    tile=TILE,
    data_source=DATA_SOURCE,
    cache_dir=CACHE_DIR,
)

if tile_ds is None:
    print("Tile dataset not found for requested inputs.")
else:
    print("Tile dims:", dict(tile_ds.dims))
    print("Tile vars (sample):", list(tile_ds.data_vars)[:8])
Tile dims: {'lat': 800, 'lon': 900}
Tile vars (sample): ['analysed_sst', 'analysis_error', 'mask', 'sea_ice_fraction']
/tmp/ipykernel_41737/3654309290.py:12: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
  print("Tile dims:", dict(tile_ds.dims))
region_ds = load_region_product_nc(
    dataset=catalog,
    date=DATE,
    region=TILE,
    data_source=DATA_SOURCE,
    cache_dir=CACHE_DIR,
)

print("Alias loader returned None?", region_ds is None)
if tile_ds is not None and region_ds is not None:
    print("Same dims as load_tile_nc?", dict(tile_ds.dims) == dict(region_ds.dims))
Alias loader returned None? False
Same dims as load_tile_nc? True
/tmp/ipykernel_41737/4172440288.py:11: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
  print("Same dims as load_tile_nc?", dict(tile_ds.dims) == dict(region_ds.dims))

6. Bbox Retrieval and Merge (load_bbox_nc)#

load_bbox_nc:

  • finds all tiles overlapping the bbox

  • downloads each tile (or reuses cache)

  • returns one dataset (single tile) or merges many via xr.combine_by_coords

bbox_ds = load_bbox_nc(
    dataset=catalog,
    date=DATE,
    bbox=BBOX,
    data_source=DATA_SOURCE,
    cache_dir=CACHE_DIR,
)

print("BBOX:", BBOX)
if bbox_ds is None:
    print("No bbox dataset found.")
else:
    print("Bbox dims:", dict(bbox_ds.dims))
    print("Bbox vars (sample):", list(bbox_ds.data_vars)[:8])
BBOX: (-95.0, -5.0, -85.0, 5.0)
Bbox dims: {'lat': 1600, 'lon': 1800}
Bbox vars (sample): ['analysed_sst', 'analysis_error', 'mask', 'sea_ice_fraction']
/tmp/ipykernel_41737/761056312.py:13: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
  print("Bbox dims:", dict(bbox_ds.dims))

7. SWOT-Specific Bbox Retrieval (load_bbox_swot_nc)#

SWOT tiles can contain non-spatial dimensions (for example track) that do not align across tiles. Before merge, the loader drops variables bound to non-spatial dims to preserve mergeable (time, lat, lon) gridded fields.

swot_ds = load_bbox_swot_nc(
    dataset=catalog,
    date=DATE,
    bbox=BBOX,
    cache_dir=CACHE_DIR,
)

if swot_ds is None:
    print("No SWOT data found for this date/bbox.")
else:
    print("SWOT dims:", dict(swot_ds.dims))
    print("SWOT vars (sample):", list(swot_ds.data_vars)[:10])

8. Multi-Source Time Series (load_multisource_time_series_nc)#

This API loops over dates and sources, then concatenates each source over time.

Contract highlights:

  • exactly one of tile or bbox

  • date range is inclusive

  • each source key is normalized to extension-free token

  • values can be None if no data found

stacked = load_multisource_time_series_nc(
    dataset=catalog,
    data_sources=MULTI_SOURCES,
    date_start=DATE_START,
    date_end=DATE_END,
    tile=TILE,
    cache_dir=CACHE_DIR,
)

for source_key, ds in stacked.items():
    if ds is None:
        print(f"{source_key}: None (no data in range)")
        continue
    time_len = int(ds.sizes.get("time", 0))
    print(f"{source_key}: dims={dict(ds.dims)} time_len={time_len}")

9. Edge Cases and Helper Semantics#

These helpers are useful to understand why API outputs look the way they do.

print("Normalize token -> filename:", _normalize_source_filename("l4_sst"))
print("Normalize filename -> filename:", _normalize_source_filename("l4_sst.nc"))
print("Source token from filename:", _source_token("l4_sst.nc"))
print("Inclusive date iteration:", _iter_dates("2024-06-01", "2024-06-03"))

try:
    _ = load_multisource_time_series_nc(
        dataset=catalog,
        data_sources=["l4_sst"],
        date_start="2024-06-01",
        date_end="2024-06-02",
        tile=TILE,
        bbox=BBOX,
    )
except ValueError as e:
    print("Expected exclusivity error:", e)

10. HuggingFace Streaming Pattern#

For record-level streaming (without full snapshot download), use datasets.load_dataset(..., streaming=True).

try:
    load_dataset = __import__("datasets").load_dataset

    stream = load_dataset("nilsleh/OceanTACO", split="train", streaming=True)
    first_row = next(iter(stream))
    print("Stream row keys:", list(first_row.keys())[:12])
except ImportError:
    print("Install extras first: pip install -e '.[hf]'")

11. Summary and Next Steps#

You have now seen the full retrieval mechanism stack from catalog filtering to merged multi-source time series.

Next steps: