Published January 3, 2025 | Version v1
Dataset Open

Mobility Networked Time Series Benchmark Datasets

Description

Overview

Human mobility is crucial for urban planning (e.g., public transportation) and epidemic response strategies. However, existing research often neglects integrating comprehensive perspectives on spatial dynamics, temporal trends, and other contextual views due to the limitations of existing mobility datasets. To bridge this gap, we introduce MOBINS (MOBIlity Networked time Series), a novel dataset collection designed for networked time-series forecasting of dynamic human movements. MOBINS features diverse and explainable datasets that capture various mobility patterns across different transportation modes in four cities and two countries and cover both transportation and epidemic domains at the administrative area level. Our experiments with nine baseline methods reveal the significant impact of different model backbones on the proposed six datasets. We provide a valuable resource for advancing urban mobility research, and our dataset collection is available at DOI 10.5281/zenodo.14590709.

 

Benchmark Code

Go to Github: https://github.com/kaist-dmlab/MOBINS

 

Benchmark Baseline List

  • Linear-based: DLinearNLinear
  • RNN-based: SegRNN
  • Transformer-based: InformerReformerPatchTST
  • CNN-based: TimesNet
  • GNN-based: STGCNMPNNLSTM

Detailed Benchmark Results

There is MOBINS_Results.pdf in the Github Link, the detailed benchmark results of MOBINS were reported with MAE, MSE, and standard deviation. 

 

Code Licence

  • Our code implementation is released under the MIT License

Code Reference

  • DLinear: https://github.com/cure-lab/LTSF-Linear
  • NLinear: https://github.com/cure-lab/LTSF-Linear
  • SegRNN: https://github.com/lss-1138/SegRNN
  • Informer: https://github.com/zhouhaoyi/Informer2020
  • Reformer: https://github.com/lucidrains/reformer-pytorch
  • PatchTST: https://github.com/yuqinie98/PatchTST
  • TimesNet: https://github.com/thuml/TimesNet
  • STGCN: https://github.com/hazdzz/STGCN
  • MPNNLSTM: https://github.com/geopanag/pandemic_tgnn
 

Benchmark Datasets

Dataset Descriptions

Dataset Locations Spatial node units Edges Domain Daily Movements Daily Amounts Time interval Time Range Frames Target dimension
Transportation Seoul 128 290 Station-based administrative area SmartCard:2.68M In/Out-flow:4.02M 1 hour 01/01/2022-12/31/2023 17520 16640
  Busan 60 121 Station-based administrative area SmartCard:0.63M In/Out-flow:0.75M 1 hour 01/01/2021-12/31/2023 26280 3720
  Daegu 61 123 Station-based administrative area SmartCard:0.10M In/Out-flow:0.34M 1 hour 01/01/2021-12/31/2023 26280 3843
  NYC 5 12 Borough Taxi:0.10M Ridership:3.03M 1 hour 02/01/2022-03/31/2024 17280 30
Epidemic Korea 16 45 City&Province SmartCards:13.41M Infection:25834 1 day 01/20/2020-08/31/2023 1320 272
  NYC 5 12 Borough Taxi:2418 Infection:2038 1 day 03/01/2020-12/31/2023 1401 30

 

Formats of datasets (MOBINS.zip)

  • csv format datasets in every environment: each dataset has three components.
    • SPATIAL_NETWORK.csv: ( n∗n where n = # of nodes )
      • Column name list: INDEX, N0, N1, …, Nn
      • INDEX list: N0, N1, …, Nn
    • NODE_TIME_SERIES_FEATURES.csv: ( t * p ) * ( n * d ) where t = # of timestamps in a day, p = total period, and d = # of variables from time series
      • Column name list: datetime, N0 _{VARIABLE_NAME}, N1 _{VARIABLE_NAME}, …, Nn _{VARIABLE_NAME}
      • VARIABLE_NAME list: Transportation-[Seoul, Busan, Deagu]} datasets (INFLOWOUTFLOW), Transportation-NYC dataset (RIDERSHIP), Epidemic-[Korea, NYC] dataset (INFECTION)
    • OD_MOVEMENTS.csv: ( t * p ) * ( n, n )
      • Column name list: N0 _ N0, N0 _ N1, N0 _ N2, … , Nn _ Nn−1 , Nn _ Nn

 

Meta datasets 

In the Github Link, there is metadata for MOBINS_Meta.pdf

Metadata for Transportation Datasets

Each file contains information about a single node or a node pair, which is abstracted for simplicity by describing only the i-th node. We omit the detailed description in metadata for Transportation-[Busan, Daegu] because the CSV file structures are identical to the metadata for Transportation_Seoul, differing only in the number of nodes, which is unique to each dataset. Transportation_NYC follows a similar structure, with the exception of the variable for node time-series features (ridership).

 

Metadata for Epidemic Datasets

Each file contains information about a single node or a node pair, which is abstracted for simplicity by describing only the i-th node. Both datasets share a consistent structure in terms of node time-series features, OD movements, and spatial networks.

 

Data Licence

 

How to Curate MOBINS 

Composition

The MOBINS dataset collection consists of mobility networked time-series data for forecasting tasks in two domains: Transportation-[Seoul, Busan, Daegu, NYC] and Epidemic-[Korea, NYC]. Each dataset comprises three key components: (1) OD movements, (2) a spatial network, and (3) time series. These datasets capture the temporal evolution of OD movements and time series within a fixed spatial network. OD movements represent the volume of movements between pairs of nodes, while time series denotes the time-varying features within each node. These datasets provide a comprehensive understanding of mobility patterns, exhibiting high correlation and synergy between OD movements and time series.

Collection Process

All datasets in the MOBINS are collected from reliable sources, including government agencies, local governments, public transportation operators, and smart card companies. These sources provide publicly accessible data downloads based on their administrative systems. The source data from smart transit card information systems is accessed through API calls at the administrative area level, such as neighborhoods or provinces, to align the spatial resolution of the time series. 
The use of data available on the Korea Public Data Portal is either unrestricted or covered by the CC BY license. For sources without a specific license indication, we obtained responses about the uses for research through inquiries via phone or email. Additionally, data from the Korea Disease Control and Prevention Agency was used without numerical value modifications after obtaining permission.

Preprocessing/Cleaning/Labeling

Each dataset in the MOBINS collection is derived from different sources for OD movements and time series. To ensure consistent spatial and temporal resolution, we align these two sources using Python. In the Transportation-[Seoul, Busan, Daegu] datasets, we use 'station-based administrative areas' as spatial node units, treating stations within the same administrative area as a single node. For the Transportation-NYC dataset, we use boroughs as spatial node units to align the spatial resolution between taxi zones and stations. In the Epidemic-Korea dataset, the source infection case data is collected at the city and province levels. Hence, we use OD movements based on the city and province levels to match spatial resolution. Similarly, for the \emph{Epidemic-NYC} dataset, we use corresponding OD movements at the borough level to maintain consistent spatial node units. After the spatial resolutions are determined, we generate the spatial network based on these resolutions. 

Regarding the temporal aspect, although the source frequency of OD movements from Transportation-[Busan, Daegu, NYC] is less than 15 minutes, we set the frequency to 1 hour in the MOBINS to match the time-series data frequency. This integration of double sources with positive or negative correlations enables the interpretation and forecasting of data from various contextual perspectives. 

Among our dataset collection, the source OD movements of the Transportation-Seoul dataset have 14 missing days (07/01/2022 -- 07/06/2022, 07/13/2022, 07/20/2022, 08/06/2022, 08/07/2022, 09/13/2022, 10/31/2022, 11/01/2022, and 12/04/2022) in the Korea Public Data Portal. These missing days are filled with additional OD movement information from the smart transit card information system. Meanwhile, source OD movements from the NYC taxi dataset contain abnormal taxi records. To provide clean NYC OD movements, we remove abnormal taxi records if the difference between drop-off and pick-up timestamps is less than 0 seconds or more than 6 hours for each record. To facilitate future data updates, we maintain backups of the raw source data.

 

Data Reference

[note] All source websites support the official English version except Smart Transit Card Information System and Korea Disease Control and Prevention Agency. Therefore, we write down how to contact or use two source datasets.

 

7. Code Reference

we implemented our benchmark code based on Time Series Library (TSLib) .

Citation

@inproceedings{na2025mobility,
  title={Mobility Networked Time Series Benchmark Datasets},
  author={Na, Jihye, and Nam, Youngeun, and Yoon, Susik and Song, Hwanjun and Lee, Byung Suk and Lee, Jae-Gil},
  booktitle={ICWSM},
  year={2025},
}
 

Acknowledgement

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (Ministry of Science and ICT) (No. 2023R1A2C2003690).

Files

MOBINS.zip

Files (253.7 MB)

Name Size Download all
md5:605bb95e35029299a18569941cc2b822
253.7 MB Preview Download