Published June 5, 2026 | Version v2

ANTAQ Ship-Turnaround Dataset and Reproduction Artifacts

  • 1. Lawgorithm
  • 2. ROR icon Universidade de São Paulo

Description

# ANTAQ Ship-Turnaround Dataset & Reproduction Artifacts 

**Data and intermediate artifacts for the paper:**

> E. C. B. Pacheco, R. P. Martins, A. dos Santos Gualberto, J. V. Souza Germano,
> M. Miranda Neto, and F. Louzada, *"Methodological Pitfalls in Predicting Ship
> Turnaround Time at Brazilian Ports: An Empirical Audit and a Reproducible
> Pipeline,"* IEEE Transactions on Intelligent Transportation Systems, 2026.

This Zenodo record hosts the **data and result artifacts** that are too large
for GitHub. The **code** lives in the companion repository:

- Code: <https://github.com/eduardocbpacheco/ANTAQ>
- Data DOI: **10.5281/zenodo.20549161

---

## What is in this archive

`antaq-turnaround-data-v1.2.zip` unzips to a single `data/` folder designed to
drop directly into the root of the code repository (`Codigo Reproducao/`):

```
data/
├── raw/                      (~6.0 GB)  raw ANTAQ tables, one folder per year
│   ├── 2018/ … 2024/                    {ano}Atracacao.txt, {ano}Carga.txt, …
│   └── categories/                       Mercadoria.txt, Instalacao_*.txt, …
├── critic_datasets/          (~90 MB)   ORIGINAL authors' published datasets (Abreu/Rao)
│   ├── CargasBR2018EDA-LOGII.csv          153,331 × 35 — their "EDA" dataset
│   └── EN-CargasBR2018Modelo-LOGII.csv    150,669 × 21 — their "cleaned model" dataset
├── processed/                (~167 MB)  df_{ano}.parquet — cargo-level merge (step 1)
├── embeddings/               (~1.7 GB)  384-d sentence embeddings per year (step 2)
├── processed_agg/            (~1.6 GB)  df_{ano}_agg.parquet — berthing level (step 3)
└── output/                   (~1.1 GB)  final matrices + hyperparameters + results
    ├── train.parquet, test.parquet      ~391k / ~98k berthings × 1613 cols
    ├── encoder.pkl, imputer.pkl, feature_sets.json   (M1=838 … M4=1606 features)
    ├── best_params_target_*.json
    ├── piso/  (Floor / ceteris paribus)  hyperparams/ + results_piso.csv + stats_piso.json
    └── teto/  (Ceiling / per-cell tuned)  hyperparams/ + results_teto.csv + stats_teto.json
```

Total uncompressed: ~11 GB.

## How to use it

1. Clone the code repository:
   ```bash
   git clone https://github.com/eduardocbpacheco/ANTAQ.git
   cd ANTAQ            # the "Codigo Reproducao" package
   ```
2. Download `antaq-turnaround-data-v1.2.zip` from this Zenodo record and unzip
   it inside that folder, so that `data/` sits next to `config.py`:
   ```bash
   unzip antaq-turnaround-data-v1.2.zip      # creates ./data/
   ```
3. Reproduce the paper's tables in seconds, or rebuild everything from raw.
   See `REPRODUCTION_GUIDE.md` (EN) / `GUIA_REPRODUCAO.md` (PT) /
   `reproduce.ipynb` in the repository.

`config.py` reads from `./data` by default; point it elsewhere with the
`ANTAQ_DATA_BASE` environment variable.

## Provenance & licensing

- The raw ANTAQ tables (`data/raw/`) are public open data from the Brazilian
  *Agência Nacional de Transportes Aquaviários* (ANTAQ),
  <https://web3.antaq.gov.br/ea/sense/download.html>. They are redistributed
  here unmodified, for archival and reproducibility, under ANTAQ's open-data
  terms. All processed/derived artifacts are released under the same license as
  the code repository.
- `data/critic_datasets/` contains the two datasets **published by the original
  authors** (Abreu et al. 2023 / Rao et al. 2025) as supplementary material to
  their papers. They are included so reviewers can independently reproduce the
  empirical critique in `EDA_1_critique_of_original_papers.ipynb`. Credit and
  rights for these two files belong to their original authors.

## Reference values (sanity check, Floor experiment, M4 vs M1)

| Target | M1 RMSE (h) | M4 vs M1 *d* | M4 vs M1 *p* |
|---|---|---|---|
| TEstadia | 76.21 ± 0.10 | +10.17 | < 10⁻⁷ |
| TEsperaAtracacao | 67.10 ± 0.07 | +17.27 | < 10⁻⁷ |
| TAtracado | 28.97 ± 0.06 | +4.69 | < 10⁻⁷ |
| TEsperaInicioOp | 18.41 ± 0.04 | −0.98 | 1.000 (n.s.) |
| TOperacao | 19.34 ± 0.02 | +5.62 | < 10⁻⁷ |
| TEsperaDesatracacao | 6.43 ± 0.02 | +0.07 | 0.159 (n.s.) |

## How to cite

Please cite **both** the paper and this dataset.

```bibtex
@article{pacheco2026turnaround,
  author  = {Pacheco, Eduardo Caruso Barbosa and Martins, Reynaldo Pereira
             and Gualberto, Alexandre dos Santos and Germano, Jo\~{a}o Vitor Souza
             and Miranda Neto, Milton and Louzada, Francisco},
  title   = {Methodological Pitfalls in Predicting Ship Turnaround Time at
             Brazilian Ports: An Empirical Audit and a Reproducible Pipeline},
  journal = {IEEE Transactions on Intelligent Transportation Systems},
  year    = {2026},
  note    = {Code: https://github.com/eduardocbpacheco/ANTAQ}
}

@dataset{pacheco2026turnaround_data,
  author    = {Pacheco, Eduardo Caruso Barbosa and Martins, Reynaldo Pereira
               and Gualberto, Alexandre dos Santos and Germano, Jo\~{a}o Vitor Souza
               and Miranda Neto, Milton and Louzada, Francisco},
  title     = {{ANTAQ Ship-Turnaround Dataset and Reproduction Artifacts (v1.2)}},
  year      = {2026},
  publisher = {Zenodo},
  version   = {1.2},
  doi       = {10.5281/zenodo.20549161}
}
```

Files

antaq-turnaround-data-v1.2.zip

Files (5.5 GB)

Name Size
md5:3bff034899e37586f0693e19479a0681
5.5 GB Preview Download
md5:792df71e475fc73eafaa183b58f3ab93
5.5 kB Preview Download

Additional details

Software

Repository URL
https://github.com/eduardocbpacheco/ANTAQ
Programming language
Python
Development Status
Active