Getting Started#
OceanTACO supports two complementary access patterns:
direct retrieval with
xarraydata loading for inspection and analysisquery-based sampling with
OceanTACODatasetfor ML training/evaluation
OceanTACODataset works with both local data and remote HuggingFace URLs, so the key decision is workflow style (analysis vs ML), not local vs remote. Note that currently, due to storage limits streaming only works for the Core dataset.
Prerequisites#
Python 3.11+
Conda or pip
Installation#
Most users should install directly from PyPI:
# Core package
pip install ocean-taco
# With HuggingFace helpers
pip install "ocean-taco[hf]"
If you want the latest development version from GitHub:
pip install "ocean_taco[hf] @ git+https://github.com/nilsleh/oceanTACO.git@main"
If you have cloned this repository and want a local editable install, run the following from the repository root:
# Dataset loading + queries + visualization (default)
pip install -e .
# Add HuggingFace client helpers for streaming/downloading
pip install -e ".[hf]"
# Add data-generation pipeline dependencies
pip install -e ".[generate]"
# Full development profile
pip install -e ".[generate,hf,tests]"
Choose the Right Workflow#
Use this quick rule:
use direct retrieval when you want to inspect one date/region/source as an
xarray.Datasetuse
OceanTACODatasetwhen you need repeated sampling over many queries and batches
Goal |
Recommended API |
Typical output |
|---|---|---|
Inspect and visualize a few subsets |
|
|
Build training/eval samples with query generation |
|
PyTorch-ready sample dicts |
Use local files only |
either API |
same as above |
Stream from HuggingFace |
either API |
same as above |
Quick Start: Retrieval (Direct xarray Access)#
Use this for cloud-native subsetting, and plotting workflows.
from ocean_taco.dataset.retrieve import HF_DEFAULT_URL, load_hf_dataset, load_bbox_nc
dataset_hf = load_hf_dataset(HF_DEFAULT_URL)
# Retrieve tiles overlapping a bbox for one date and one data source.
ds = load_bbox_nc(
dataset_hf,
date="2024-06-01",
bbox=(-80, -30, 25, 50), # (lon_min, lon_max, lat_min, lat_max)
data_source="l4_sst",
cache_dir="./cache", # optional local cache
)
For more retrieval patterns (single tile, full snapshot, stream records), see OceanTACO Dataset Workflows.
Quick Start: ML Loader (OceanTACODataset)#
Use this when you need consistent query sampling for model training/evaluation.
taco_path can be either:
a local OceanTACO dataset path
a remote dataset URL such as
HF_DEFAULT_URL
from torch.utils.data import DataLoader
from ocean_taco.dataset import OceanTACODataset, collate_ocean_samples
from ocean_taco.dataset.queries import QueryGenerator, PatchSize
from ocean_taco.dataset.retrieve import HF_DEFAULT_URL
dataset = OceanTACODataset(
taco_path=HF_DEFAULT_URL, # or "/path/to/OceanTACO"
input_variables=["l4_ssh", "l4_sst", "glorys_sss"],
target_variables=["l3_swot"],
temporal_agg="mean",
)
generator = QueryGenerator(land_mask_path=".ocean_mask_cache/land_mask.npy")
queries = generator.generate_training_queries(
n_queries=1000,
patch_size=PatchSize(2.0, "deg"),
date_range=("2024-01-01", "2024-03-31"),
)
loader = DataLoader(
dataset,
sampler=queries,
batch_size=16,
collate_fn=collate_ocean_samples,
num_workers=4,
)
For detailed ML guidance (query design, batching, patch sizing, training vs eval), see OceanTACO ML Loader Workflows.
Next Steps#
See the Dataset Description for a full list of variables and ocean regions.
See OceanTACO Dataset Workflows for retrieval and streaming workflows.
See OceanTACO ML Loader Workflows for end-to-end ML loader workflows.
See the Dataset Generation Pipeline page for the full raw-data -> formatted -> TACO build pipeline.
Walk through the Tutorials for end-to-end examples.
Consult the API Reference for the full public API reference.