Getting Started#

OceanTACO supports two complementary access patterns:

direct retrieval with xarray data loading for inspection and analysis
query-based sampling with OceanTACODataset for ML training/evaluation

OceanTACODataset works with both local data and remote HuggingFace URLs, so the key decision is workflow style (analysis vs ML), not local vs remote. Note that currently, due to storage limits streaming only works for the Core dataset.

Prerequisites#

Python 3.11+
Conda or pip

Installation#

Most users should install directly from PyPI:

# Core package
pip install ocean-taco

# With HuggingFace helpers
pip install "ocean-taco[hf]"

If you want the latest development version from GitHub:

pip install "ocean_taco[hf] @ git+https://github.com/nilsleh/oceanTACO.git@main"

If you have cloned this repository and want a local editable install, run the following from the repository root:

# Dataset loading + queries + visualization (default)
pip install -e .

# Add HuggingFace client helpers for streaming/downloading
pip install -e ".[hf]"

# Add data-generation pipeline dependencies
pip install -e ".[generate]"

# Full development profile
pip install -e ".[generate,hf,tests]"

Choose the Right Workflow#

Use this quick rule:

use direct retrieval when you want to inspect one date/region/source as an xarray.Dataset
use OceanTACODataset when you need repeated sampling over many queries and batches

Goal	Recommended API	Typical output
Inspect and visualize a few subsets	`ocean_taco.dataset.retrieve` helpers	`xr.Dataset`
Build training/eval samples with query generation	`OceanTACODataset` + `QueryGenerator`	PyTorch-ready sample dicts
Use local files only	either API	same as above
Stream from HuggingFace	either API	same as above

Quick Start: Retrieval (Direct `xarray` Access)#

Use this for cloud-native subsetting, and plotting workflows.

from ocean_taco.dataset.retrieve import HF_DEFAULT_URL, load_hf_dataset, load_bbox_nc

dataset_hf = load_hf_dataset(HF_DEFAULT_URL)

# Retrieve tiles overlapping a bbox for one date and one data source.
ds = load_bbox_nc(
    dataset_hf,
    date="2024-06-01",
    bbox=(-80, -30, 25, 50),   # (lon_min, lon_max, lat_min, lat_max)
    data_source="l4_sst",
    cache_dir="./cache",       # optional local cache
)

For more retrieval patterns (single tile, full snapshot, stream records), see OceanTACO Dataset Workflows.

Quick Start: ML Loader (`OceanTACODataset`)#

Use this when you need consistent query sampling for model training/evaluation.

taco_path can be either:

a local OceanTACO dataset path
a remote dataset URL such as HF_DEFAULT_URL

from torch.utils.data import DataLoader
from ocean_taco.dataset import OceanTACODataset, collate_ocean_samples
from ocean_taco.dataset.queries import QueryGenerator, PatchSize
from ocean_taco.dataset.retrieve import HF_DEFAULT_URL

dataset = OceanTACODataset(
    taco_path=HF_DEFAULT_URL,  # or "/path/to/OceanTACO"
    input_variables=["l4_ssh", "l4_sst", "glorys_sss"],
    target_variables=["l3_swot"],
    temporal_agg="mean",
)

generator = QueryGenerator(land_mask_path=".ocean_mask_cache/land_mask.npy")
queries = generator.generate_training_queries(
    n_queries=1000,
    patch_size=PatchSize(2.0, "deg"),
    date_range=("2024-01-01", "2024-03-31"),
)

loader = DataLoader(
    dataset,
    sampler=queries,
    batch_size=16,
    collate_fn=collate_ocean_samples,
    num_workers=4,
)

For detailed ML guidance (query design, batching, patch sizing, training vs eval), see OceanTACO ML Loader Workflows.

Next Steps#

See the Dataset Description for a full list of variables and ocean regions.
See OceanTACO Dataset Workflows for retrieval and streaming workflows.
See OceanTACO ML Loader Workflows for end-to-end ML loader workflows.
See the Dataset Generation Pipeline page for the full raw-data -> formatted -> TACO build pipeline.
Walk through the Tutorials for end-to-end examples.
Consult the API Reference for the full public API reference.

Getting Started

Contents