ANTAQ Ship-Turnaround Dataset and Reproduction Artifacts

Caruso Barbosa Pacheco, Eduardo

doi:10.5281/zenodo.20549767

Published June 5, 2026 | Version v2

Dataset Open

ANTAQ Ship-Turnaround Dataset and Reproduction Artifacts

Caruso Barbosa Pacheco, Eduardo (Researcher)^{1, 2}

1. Lawgorithm
2. Universidade de São Paulo

# ANTAQ Ship-Turnaround Dataset & Reproduction Artifacts

**Data and intermediate artifacts for the paper:**

> E. C. B. Pacheco, R. P. Martins, A. dos Santos Gualberto, J. V. Souza Germano,
> M. Miranda Neto, and F. Louzada, *"Methodological Pitfalls in Predicting Ship
> Turnaround Time at Brazilian Ports: An Empirical Audit and a Reproducible
> Pipeline,"* IEEE Transactions on Intelligent Transportation Systems, 2026.

This Zenodo record hosts the **data and result artifacts** that are too large
for GitHub. The **code** lives in the companion repository:

- Code: <https://github.com/eduardocbpacheco/ANTAQ>
- Data DOI: **10.5281/zenodo.20549161

---

## What is in this archive

`antaq-turnaround-data-v1.2.zip` unzips to a single `data/` folder designed to
drop directly into the root of the code repository (`Codigo Reproducao/`):

```
data/
├── raw/ (~6.0 GB) raw ANTAQ tables, one folder per year
│ ├── 2018/ … 2024/ {ano}Atracacao.txt, {ano}Carga.txt, …
│ └── categories/ Mercadoria.txt, Instalacao_*.txt, …
├── critic_datasets/ (~90 MB) ORIGINAL authors' published datasets (Abreu/Rao)
│ ├── CargasBR2018EDA-LOGII.csv 153,331 × 35 — their "EDA" dataset
│ └── EN-CargasBR2018Modelo-LOGII.csv 150,669 × 21 — their "cleaned model" dataset
├── processed/ (~167 MB) df_{ano}.parquet — cargo-level merge (step 1)
├── embeddings/ (~1.7 GB) 384-d sentence embeddings per year (step 2)
├── processed_agg/ (~1.6 GB) df_{ano}_agg.parquet — berthing level (step 3)
└── output/ (~1.1 GB) final matrices + hyperparameters + results
├── train.parquet, test.parquet ~391k / ~98k berthings × 1613 cols
├── encoder.pkl, imputer.pkl, feature_sets.json (M1=838 … M4=1606 features)
├── best_params_target_*.json
├── piso/ (Floor / ceteris paribus) hyperparams/ + results_piso.csv + stats_piso.json
└── teto/ (Ceiling / per-cell tuned) hyperparams/ + results_teto.csv + stats_teto.json
```

Total uncompressed: ~11 GB.

## How to use it

1. Clone the code repository:
```bash
git clone https://github.com/eduardocbpacheco/ANTAQ.git
cd ANTAQ # the "Codigo Reproducao" package
```
2. Download `antaq-turnaround-data-v1.2.zip` from this Zenodo record and unzip
it inside that folder, so that `data/` sits next to `config.py`:
```bash
unzip antaq-turnaround-data-v1.2.zip # creates ./data/
```
3. Reproduce the paper's tables in seconds, or rebuild everything from raw.
See `REPRODUCTION_GUIDE.md` (EN) / `GUIA_REPRODUCAO.md` (PT) /
`reproduce.ipynb` in the repository.

`config.py` reads from `./data` by default; point it elsewhere with the
`ANTAQ_DATA_BASE` environment variable.

## Provenance & licensing

- The raw ANTAQ tables (`data/raw/`) are public open data from the Brazilian
*Agência Nacional de Transportes Aquaviários* (ANTAQ),
<https://web3.antaq.gov.br/ea/sense/download.html>. They are redistributed
here unmodified, for archival and reproducibility, under ANTAQ's open-data
terms. All processed/derived artifacts are released under the same license as
the code repository.
- `data/critic_datasets/` contains the two datasets **published by the original
authors** (Abreu et al. 2023 / Rao et al. 2025) as supplementary material to
their papers. They are included so reviewers can independently reproduce the
empirical critique in `EDA_1_critique_of_original_papers.ipynb`. Credit and
rights for these two files belong to their original authors.

## Reference values (sanity check, Floor experiment, M4 vs M1)

| Target | M1 RMSE (h) | M4 vs M1 *d* | M4 vs M1 *p* |
|---|---|---|---|
| TEstadia | 76.21 ± 0.10 | +10.17 | < 10⁻⁷ |
| TEsperaAtracacao | 67.10 ± 0.07 | +17.27 | < 10⁻⁷ |
| TAtracado | 28.97 ± 0.06 | +4.69 | < 10⁻⁷ |
| TEsperaInicioOp | 18.41 ± 0.04 | −0.98 | 1.000 (n.s.) |
| TOperacao | 19.34 ± 0.02 | +5.62 | < 10⁻⁷ |
| TEsperaDesatracacao | 6.43 ± 0.02 | +0.07 | 0.159 (n.s.) |

## How to cite

Please cite **both** the paper and this dataset.

```bibtex
@article{pacheco2026turnaround,
author = {Pacheco, Eduardo Caruso Barbosa and Martins, Reynaldo Pereira
and Gualberto, Alexandre dos Santos and Germano, Jo\~{a}o Vitor Souza
and Miranda Neto, Milton and Louzada, Francisco},
title = {Methodological Pitfalls in Predicting Ship Turnaround Time at
Brazilian Ports: An Empirical Audit and a Reproducible Pipeline},
journal = {IEEE Transactions on Intelligent Transportation Systems},
year = {2026},
note = {Code: https://github.com/eduardocbpacheco/ANTAQ}
}

@dataset{pacheco2026turnaround_data,
author = {Pacheco, Eduardo Caruso Barbosa and Martins, Reynaldo Pereira
and Gualberto, Alexandre dos Santos and Germano, Jo\~{a}o Vitor Souza
and Miranda Neto, Milton and Louzada, Francisco},
title = {{ANTAQ Ship-Turnaround Dataset and Reproduction Artifacts (v1.2)}},
year = {2026},
publisher = {Zenodo},
version = {1.2},
doi = {10.5281/zenodo.20549161}
}
```

Files

antaq-turnaround-data-v1.2.zip

Files (5.5 GB)

Name	Size
antaq-turnaround-data-v1.2.zip md5:3bff034899e37586f0693e19479a0681	5.5 GB	Preview Download
README_ZENODO.md md5:792df71e475fc73eafaa183b58f3ab93	5.5 kB	Preview Download

Additional details

Repository URL: https://github.com/eduardocbpacheco/ANTAQ
Programming language: Python
Development Status: Active

	All versions	This version
Views	35	27
Downloads	9	8
Data volume	44.3 GB	44.3 GB

ANTAQ Ship-Turnaround Dataset and Reproduction Artifacts

Authors/Creators

Description

Files

antaq-turnaround-data-v1.2.zip

Files (5.5 GB)

Additional details

Software