OceanTACO Data Retrieval Workflows#
This notebook is a self-contained guide to OceanTACO retrieval APIs before ML batching.
It covers:
Catalog loading and file discovery
Direct file fetch internals
Tile and bbox retrieval
SWOT-specific merge behavior
Multi-source time series loading
Edge-case handling and HuggingFace streaming
1. Install Profiles#
# Verify ocean_taco is installed.
# Uncomment one line to install inside this notebook:
# %pip install "ocean_taco[hf] @ git+https://github.com/nilsleh/oceanTACO.git@main"
# %pip install ocean-taco[hf]
import ocean_taco
print(f"ocean_taco {ocean_taco.__version__} ready")
ocean_taco 0.1.0 ready
from pathlib import Path
from IPython.display import display
from ocean_taco.dataset.retrieve import (
HF_DEFAULT_URL,
_iter_dates,
_normalize_source_filename,
_source_token,
_tile_bbox,
fetch_nc,
load_bbox_nc,
load_bbox_swot_nc,
load_hf_dataset,
load_multisource_time_series_nc,
load_region_product_nc,
load_tile_nc,
query_files,
)
# -- Configuration --
HF_URL = HF_DEFAULT_URL
CACHE_DIR = "./cache"
# Keep defaults small for fast tutorial execution.
DATE = "2024-06-01"
DATE_START = "2024-06-01"
DATE_END = "2024-06-03"
TILE = "NORTH_ATLANTIC"
BBOX = (-95.0, -5.0, -85.0, 5.0)
DATA_SOURCE = "l4_sst"
MULTI_SOURCES = ["l4_ssh", "l4_sst", "glorys"]
Path(CACHE_DIR).mkdir(parents=True, exist_ok=True)
print("Configuration ready")
Configuration ready
2. Retrieval API Map#
Main retrieval functions in ocean_taco.dataset.retrieve:
load_hf_dataset(url): open catalog metadata from HuggingFacequery_files(dataset, date, bbox, needed_files): filter catalog recordsfetch_nc(row): download one NetCDF row into anxarray.Datasetload_tile_nc(...): one tile, one source, one dateload_region_product_nc(...): alias ofload_tile_ncload_bbox_nc(...): all overlapping tiles for bbox + mergeload_bbox_swot_nc(...): SWOT-specific bbox load with non-spatial dim cleanupload_multisource_time_series_nc(...): source x date-range loader
catalog = load_hf_dataset(HF_URL)
print("Catalog loaded:", type(catalog))
Catalog loaded: <class 'tacoreader.dataset.TacoDataset'>
3. File Discovery Internals (query_files)#
query_files chains:
filter_datetime(f"{date}/{date}")filter_bbox(*bbox, level=1)flatten()filename filter via
l2:id
Bounding box order is (lon_min, lat_min, lon_max, lat_max).
resolved_name = _normalize_source_filename(DATA_SOURCE)
discovered = query_files(catalog, DATE, _tile_bbox(TILE), {resolved_name})
print("Resolved filename:", resolved_name)
print("Rows found:", len(discovered))
if len(discovered) > 0:
preview_cols = [c for c in ["l2:id", "gdal_vsi", "internal:gdal_vsi", "url", "href"] if c in discovered.columns]
display(discovered[preview_cols].head(3))
Resolved filename: l4_sst.nc
Rows found: 6
| l2:id | gdal_vsi | |
|---|---|---|
| 7 | l4_sst.nc | /vsicurl/https://huggingface.co/datasets/nilsl... |
| 18 | l4_sst.nc | /vsicurl/https://huggingface.co/datasets/nilsl... |
| 29 | l4_sst.nc | /vsicurl/https://huggingface.co/datasets/nilsl... |
4. Single Row Fetch Internals (fetch_nc)#
fetch_nc scans URL-like columns in this order: internal:gdal_vsi, gdal_vsi, url, href, then reads NetCDF with h5netcdf.
if len(discovered) == 0:
print("No rows available to demonstrate fetch_nc.")
else:
fname, ds_one = fetch_nc(discovered.iloc[0])
print("Fetched file:", fname)
print("Dataset dims:", dict(ds_one.dims))
print("Data variables:", list(ds_one.data_vars)[:8])
Fetched file: l4_sst.nc
Dataset dims: {'lat': 800, 'lon': 900}
Data variables: ['analysed_sst', 'analysis_error', 'mask', 'sea_ice_fraction']
/tmp/ipykernel_41737/945767058.py:6: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
print("Dataset dims:", dict(ds_one.dims))
5. One Exact Tile (load_tile_nc)#
Use this path for deterministic region-level analysis.
tile_ds = load_tile_nc(
dataset=catalog,
date=DATE,
tile=TILE,
data_source=DATA_SOURCE,
cache_dir=CACHE_DIR,
)
if tile_ds is None:
print("Tile dataset not found for requested inputs.")
else:
print("Tile dims:", dict(tile_ds.dims))
print("Tile vars (sample):", list(tile_ds.data_vars)[:8])
Tile dims: {'lat': 800, 'lon': 900}
Tile vars (sample): ['analysed_sst', 'analysis_error', 'mask', 'sea_ice_fraction']
/tmp/ipykernel_41737/3654309290.py:12: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
print("Tile dims:", dict(tile_ds.dims))
region_ds = load_region_product_nc(
dataset=catalog,
date=DATE,
region=TILE,
data_source=DATA_SOURCE,
cache_dir=CACHE_DIR,
)
print("Alias loader returned None?", region_ds is None)
if tile_ds is not None and region_ds is not None:
print("Same dims as load_tile_nc?", dict(tile_ds.dims) == dict(region_ds.dims))
Alias loader returned None? False
Same dims as load_tile_nc? True
/tmp/ipykernel_41737/4172440288.py:11: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
print("Same dims as load_tile_nc?", dict(tile_ds.dims) == dict(region_ds.dims))
6. Bbox Retrieval and Merge (load_bbox_nc)#
load_bbox_nc:
finds all tiles overlapping the bbox
downloads each tile (or reuses cache)
returns one dataset (single tile) or merges many via
xr.combine_by_coords
bbox_ds = load_bbox_nc(
dataset=catalog,
date=DATE,
bbox=BBOX,
data_source=DATA_SOURCE,
cache_dir=CACHE_DIR,
)
print("BBOX:", BBOX)
if bbox_ds is None:
print("No bbox dataset found.")
else:
print("Bbox dims:", dict(bbox_ds.dims))
print("Bbox vars (sample):", list(bbox_ds.data_vars)[:8])
BBOX: (-95.0, -5.0, -85.0, 5.0)
Bbox dims: {'lat': 1600, 'lon': 1800}
Bbox vars (sample): ['analysed_sst', 'analysis_error', 'mask', 'sea_ice_fraction']
/tmp/ipykernel_41737/761056312.py:13: FutureWarning: The return type of `Dataset.dims` will be changed to return a set of dimension names in future, in order to be more consistent with `DataArray.dims`. To access a mapping from dimension names to lengths, please use `Dataset.sizes`.
print("Bbox dims:", dict(bbox_ds.dims))
7. SWOT-Specific Bbox Retrieval (load_bbox_swot_nc)#
SWOT tiles can contain non-spatial dimensions (for example track) that do not align across tiles.
Before merge, the loader drops variables bound to non-spatial dims to preserve mergeable (time, lat, lon) gridded fields.
swot_ds = load_bbox_swot_nc(
dataset=catalog,
date=DATE,
bbox=BBOX,
cache_dir=CACHE_DIR,
)
if swot_ds is None:
print("No SWOT data found for this date/bbox.")
else:
print("SWOT dims:", dict(swot_ds.dims))
print("SWOT vars (sample):", list(swot_ds.data_vars)[:10])
8. Multi-Source Time Series (load_multisource_time_series_nc)#
This API loops over dates and sources, then concatenates each source over time.
Contract highlights:
exactly one of
tileorbboxdate range is inclusive
each source key is normalized to extension-free token
values can be
Noneif no data found
stacked = load_multisource_time_series_nc(
dataset=catalog,
data_sources=MULTI_SOURCES,
date_start=DATE_START,
date_end=DATE_END,
tile=TILE,
cache_dir=CACHE_DIR,
)
for source_key, ds in stacked.items():
if ds is None:
print(f"{source_key}: None (no data in range)")
continue
time_len = int(ds.sizes.get("time", 0))
print(f"{source_key}: dims={dict(ds.dims)} time_len={time_len}")
9. Edge Cases and Helper Semantics#
These helpers are useful to understand why API outputs look the way they do.
print("Normalize token -> filename:", _normalize_source_filename("l4_sst"))
print("Normalize filename -> filename:", _normalize_source_filename("l4_sst.nc"))
print("Source token from filename:", _source_token("l4_sst.nc"))
print("Inclusive date iteration:", _iter_dates("2024-06-01", "2024-06-03"))
try:
_ = load_multisource_time_series_nc(
dataset=catalog,
data_sources=["l4_sst"],
date_start="2024-06-01",
date_end="2024-06-02",
tile=TILE,
bbox=BBOX,
)
except ValueError as e:
print("Expected exclusivity error:", e)
10. HuggingFace Streaming Pattern#
For record-level streaming (without full snapshot download), use datasets.load_dataset(..., streaming=True).
try:
load_dataset = __import__("datasets").load_dataset
stream = load_dataset("nilsleh/OceanTACO", split="train", streaming=True)
first_row = next(iter(stream))
print("Stream row keys:", list(first_row.keys())[:12])
except ImportError:
print("Install extras first: pip install -e '.[hf]'")
11. Summary and Next Steps#
You have now seen the full retrieval mechanism stack from catalog filtering to merged multi-source time series.
Next steps:
Move to ML batching with
OceanTACODatasetin OceanTACO Dataset & Query APIUse query generation in Spatio-Temporal Query Generation Intuition
Keep this notebook as your retrieval debugging reference