OceanTACO Dataset Workflows#
This guide focuses on data retrieval and access patterns before ML batching.
Use this page when you want
xarray.Datasetoutputs for inspection, or plotting.For ML query sampling with
OceanTACODataset, see OceanTACO ML Loader Workflows.
1) Install Profiles#
# Main usage (dataset + queries + visualization)
pip install -e .
# Add data generation pipeline deps
pip install -e ".[generate]"
# Add Hugging Face client libs for direct download/stream examples
pip install -e ".[hf]"
2) Main Retrieval APIs#
From ocean_taco/dataset/remote.py:
load_hf_dataset: open OceanTACO catalog from HuggingFace.load_bbox_nc: load and merge all tiles overlapping a bbox for one source.load_tile_nc: load exactly one named region tile.load_region_product_nc: alias for loading one region product file.load_bbox_swot_nc: helper for SWOT bbox loading.load_multisource_time_series_nc: load multiple sources over a date range and stack each source overtime.
From datasets / huggingface_hub:
load_dataset(..., streaming=True): iterate rows without full download.snapshot_download(...): download full dataset snapshot locally.
3) End-to-End Retrieval Example#
from ocean_taco.dataset.retrieve import HF_DEFAULT_URL, load_hf_dataset, load_bbox_nc
# 1) Open remote catalog
catalog = load_hf_dataset(HF_DEFAULT_URL)
# 2) Load one source over one bbox/date
sst = load_bbox_nc(
dataset=catalog,
date="2024-06-01",
bbox=(-80, -30, 25, 50),
data_source="l4_sst",
cache_dir="./cache", # optional
)
print(sst)
4) Retrieval Pattern: One Exact Tile#
Use this when you already know the region name (for reproducible region-level analysis).
from ocean_taco.dataset.retrieve import load_hf_dataset, load_tile_nc
catalog = load_hf_dataset()
ds = load_tile_nc(
dataset=catalog,
date="2024-06-01",
tile="NORTH_ATLANTIC",
data_source="l4_ssh",
cache_dir="./cache",
)
5) Retrieval Pattern: Stream Records#
Use this for record-level streaming with HuggingFace Datasets.
from datasets import load_dataset
stream = load_dataset("nilsleh/OceanTACO", split="train", streaming=True)
row = next(iter(stream))
print(row)
6) Retrieval Pattern: Local Full Snapshot#
Use this when offline access or repeated full scans are needed.
from huggingface_hub import snapshot_download
local_dir = snapshot_download(repo_id="nilsleh/OceanTACO", repo_type="dataset")
print(local_dir)
7) Retrieval Pattern: Multiple Sources, Same Region, Date Range#
Use this when you want a small time series bundle over one fixed region, with multiple products loaded for each day.
from ocean_taco.dataset.retrieve import (
HF_DEFAULT_URL,
load_hf_dataset,
load_multisource_time_series_nc,
)
catalog = load_hf_dataset(HF_DEFAULT_URL)
sources = ["l4_ssh", "l4_sst", "l4_sss", "glorys"]
# Option A: fixed named region tile
stacked_by_source = load_multisource_time_series_nc(
dataset=catalog,
data_sources=sources,
date_start="2024-06-01",
date_end="2024-06-07",
tile="NORTH_ATLANTIC",
cache_dir="./cache", # optional
)
# Access by source name; each dataset is stacked on time
print(stacked_by_source["l4_sst.nc"])
You can use the same utility with a custom bbox:
stacked_bbox = load_multisource_time_series_nc(
dataset=catalog,
data_sources=["l4_ssh", "l4_sst"],
date_start="2024-06-01",
date_end="2024-06-07",
bbox=(-80, 25, -30, 50),
cache_dir="./cache",
)
Notes:
return type is
dict[source_filename, xr.Dataset | None]keys are normalized filenames, e.g.
l4_sst.ncvalues can be
Nonewhen no days were found for that source in the rangepass exactly one of
tileorbbox
8) Choosing Retrieval Mode#
Need |
Recommended method |
|---|---|
One date + bbox + source |
|
One known region tile |
|
Multi-source time series over one region |
|
Multi-source time series over a custom area |
|
Record-by-record streaming |
|
Full local copy |
|
9) Next Step: ML Loader Workflows#
After retrieval validation, move to query-based sampling and DataLoader integration in OceanTACO ML Loader Workflows.