Dataset Generation Pipeline#
This page explains how to generate OceanTACO from raw source data in three steps:
Download raw source data (
download.py)Format source data into regional NetCDF tiles + inventory (
format.py)Build a TACO dataset from the formatted inventory (
build_taco.py)
The goal is to give you a self-contained workflow you can run and adapt.
Prerequisites#
Install generation dependencies from the repository root:
pip install -e ".[generate]"
If you use conda in this repo:
conda activate testpy311
pip install -e ".[generate]"
Directory Flow#
The three steps use this flow:
Step 1 output (
--output-dirindownload.py): raw source files (for example./ssh_state_data)Step 2 input (
--data-dirinformat.py): same raw source directoryStep 2 output (
--output-dirinformat.py): formatted regional files + inventory parquet (for example./formatted_ssh_data)Step 3 input:
--data-dirpoints to formatted directory,--inventory-pathpoints to inventory parquetStep 3 output (
--output-dirinbuild_taco.py): final OceanTACO folder/parts
Quick End-to-End Commands#
# 1) Download raw data (dry run by default)
python ocean_taco/generate_dataset/download.py \
--start-date 2024-01-01 \
--end-date 2024-01-04 \
--output-dir ./ssh_state_data \
--download
# 2) Format into regional NetCDF + inventory parquet
python ocean_taco/generate_dataset/format.py \
--date-min 2024-01-01 \
--date-max 2024-01-04 \
--data-dir ./ssh_state_data \
--output-dir ./formatted_ssh_data \
--inventory-path file_inventory.parquet \
--processes 4
# 3) Build TACO from formatted files
python ocean_taco/generate_dataset/build_taco.py \
--data-dir ./formatted_ssh_data \
--output-dir ./tortilla \
--inventory-path ./formatted_ssh_data/file_inventory.parquet
Step 1: Download Raw Data (download.py)#
Script:
ocean_taco/generate_dataset/download.py
Example:
python ocean_taco/generate_dataset/download.py \
--start-date 2024-01-01 \
--end-date 2024-01-31 \
--output-dir ./ssh_state_data \
--download
CLI Options#
Argument |
Type |
Default |
Purpose |
|---|---|---|---|
|
str ( |
|
Start date (inclusive). |
|
str ( |
|
End date (inclusive). |
|
path |
|
Root directory for downloaded raw files. |
|
path |
|
Optional custom log/report directory. |
|
flag |
|
Preview actions only. If neither mode flag is set, script still behaves as dry run. |
|
flag |
|
Perform real downloads. |
|
flag |
|
Split date range into weekly segments. |
|
str |
|
AVISO FTP username (for SWOT FTP flows). |
|
str |
|
AVISO FTP password (for SWOT FTP flows). |
|
choice `l2 |
l3` |
|
|
flag |
|
Continue with remaining datasets if one fails. |
Important Current Behavior#
In the current
main()implementation, onlyL3 SSHis actively enabled in thedownload_functionslist.Other dataset download calls are present but currently commented out in
download.py.The script writes structured logs and a JSON report in the log directory.
Step 2: Format and Regionalize Data (format.py)#
Script:
ocean_taco/generate_dataset/format.py
This step:
reads raw source files
splits/aggregates into 8 spatial regions
writes formatted regional NetCDF files
writes/updates
file_inventory.parquet
Example:
python ocean_taco/generate_dataset/format.py \
--date-min 2024-01-01 \
--date-max 2024-01-31 \
--data-dir ./ssh_state_data \
--output-dir ./formatted_ssh_data \
--inventory-path file_inventory.parquet \
--processes 4
CLI Options#
Argument |
Type |
Default |
Purpose |
|---|---|---|---|
|
str ( |
|
Start date (inclusive). |
|
str ( |
|
End date (inclusive). |
|
path |
|
Root directory containing raw downloaded data. |
|
path |
|
Root directory for regional formatted outputs. |
|
filename/path |
|
Inventory filename (written under output dir). |
|
int |
|
Number of worker processes. |
|
bool toggle |
|
Include or skip L3 SWOT processing. |
|
bool toggle |
|
Include or skip L3 SSH processing. |
|
bool toggle |
|
Include or skip Argo processing. |
|
list of strings |
|
Restrict processing to specific sources (for example |
|
flag |
|
Merge into existing inventory instead of replacing it. |
Output You Should Expect#
Regional files under modality-specific directories in
--output-dirInventory parquet at:
<output-dir>/<inventory-path>
By default that is:
./formatted_ssh_data/file_inventory.parquet
Step 3: Build TACO (build_taco.py)#
Script:
ocean_taco/generate_dataset/build_taco.py
This step:
loads formatted inventory parquet
groups files into Date -> Region -> Files hierarchy
writes final TACO output
optionally verifies readability
Example:
python ocean_taco/generate_dataset/build_taco.py \
--data-dir ./formatted_ssh_data \
--output-dir ./tortilla \
--inventory-path ./formatted_ssh_data/file_inventory.parquet
CLI Options#
Argument |
Type |
Default |
Purpose |
|---|---|---|---|
|
path |
|
Root directory with formatted regional files. |
|
path |
|
Output directory for built OceanTACO artifacts. |
|
path |
required |
Input inventory parquet path. |
|
bool toggle |
|
Include/skip L3 SWOT during build. |
|
bool toggle |
|
Include/skip Argo during build. |
|
flag |
|
Verify built TACO by loading it with |
|
str ( |
|
Optional inventory filter lower bound. |
|
str ( |
|
Optional inventory filter upper bound. |
|
flag |
|
Run duplicate analysis and exit without build. |
|
path |
|
Optional parquet output for duplicate-analysis details. |
Recommended Workflow Patterns#
1. Safe first pass (dry run + small date range)#
python ocean_taco/generate_dataset/download.py --start-date 2024-01-01 --end-date 2024-01-03 --dry-run
python ocean_taco/generate_dataset/format.py --date-min 2024-01-01 --date-max 2024-01-03 --processes 2
python ocean_taco/generate_dataset/build_taco.py --inventory-path ./formatted_ssh_data/file_inventory.parquet
2. Incremental updates#
If you rerun formatting for a new date window and want to keep old inventory entries:
python ocean_taco/generate_dataset/format.py \
--date-min 2024-02-01 \
--date-max 2024-02-07 \
--update-existing-inventory
3. Duplicate investigation before build#
python ocean_taco/generate_dataset/build_taco.py \
--inventory-path ./formatted_ssh_data/file_inventory.parquet \
--analyze-duplicates-only \
--duplicate-report-path ./reports/duplicates.parquet
Troubleshooting#
If build fails, first verify
--inventory-pathexists and points to the formatted step output.If output looks incomplete, confirm which data sources are currently enabled in
download.pydownload_functions.For long date ranges, start with
--weekly-batchesin download and moderate--processesin format.Use
--continue-on-errorin download only when partial results are acceptable.