A Synthetic NCD based on Athens pilot cases
Authors/Creators
- 1. Performance Technologies, Data and Analytics Engineer
- 2. Performance Technologies, Head of Analytics
- 3. IDIAP Research Institute, Postdoctoral Researcher
- 4. IDIAP Research Institute, Senior Research Scientist
Description
The synthetic NCD dataset, based on Athens pilot cases, is a simulated yet realistic resource that can support the testing of AI and analytics algorithms for crime resolution.
The first version of the dataset (SAEX1), was generated to simulate terminal movements within a 2x2 km area over a four-hour window on July 28, 2023, from 08:00 to 12:00. It contains over 6.9 million signaling events across 32,485 terminals, with each terminal averaging 213 events. Each event logs a terminal’s location, timestamp and its proximity to a serving cell along with the cell coordinates. To align with the investigation of a crime scene, the dataset was filtered to include only those individuals whose movement patterns intersected a predefined bounding box around Kerameikos, Athens, Greece, during the specified timeframe. This filtering process resulted in a refined dataset of trajectory data for 30,703 individuals. Figure 1 illustrates a visual representation of the final generated data for the SAEX1 dataset.
Figure 1: Initial SAEX Dataset.
As part of the TRACY project, the implemented algorithms are being evaluated using real pilot cases, provided by LEA Partners. This evaluation revealed the need for an update to the initial version of the simulated data, as the cases included additional information beyond the cell data. Specifically, they also featured Call Detail Records (CDRs), which encompassed call logs, SMS messages and mobile data. To address this, the initial data was enriched with the following details:
- Provider Information: Identifying whether a terminal is a customer of one of the Greek telecommunications providers (Cosmote, Vodafone or Nova).
- Actions: Indicating whether a terminal made or received a call, sent or received an SMS or engaged in internet browsing activities.
To accomplish this, a custom CDR Simulator was developed. The simulator generates random pairs of terminals that engage in actions (Calls or SMSs), along with the corresponding timeframes during which these actions occur. Additionally, the simulator selects specific terminals to assign browsing activities. Finally, the generated data is enriched with information about the first and last serving cells, ensuring the creation of a realistic synthetic case dataset.
Finally, since the real case data is provided in multiple files per provider, the same structure has been adopted in the simulated dataset. It is important to note that each file contains cell information only for terminals that are customers of the respective provider. For all other terminals, the cell information is left blank. This approach reflects the structure and content of the real cases and enables the use of the same data pipeline process developed within the TRACY project.
SAEX2 employs the same simulation settings as SAEX1. However, it involves a single criminal (ground truth) who follows a predetermined route within a specified timeframe, thereby enabling detection by the TRACY algorithm. Table 2 shows the dataset’s main characteristics.
|
Metric |
Value |
|
Simulation Area Width x Height (Km) |
2x2 |
|
Simulation Start-time |
2023-07-28 08:00:00 |
|
Simulation End-time |
2023-07-28 12:00:00 |
|
Simulation Duration (minutes) |
240 |
|
Number of Terminals |
30692 |
|
Number of Events |
1469515 |
|
Number of Produced CSVs |
6 (2 per provider) |
Table 2: SAEX2 Vital Statistics.
In the absence of real data for experimentation, having a synthetic yet realistic NCD dataset is crucial for developing and evaluating crime resolution techniques.
Open-source data has been used during the development of this synthetic dataset, more specifially, open-source data from OpenCellid, as well as open government data from the Hellenic Statistical Authority and the Antenna Construction Information Portal developed by the Hellenic Telecommunications and Post Commission (EETT).
TRACY is funded under DIGITAL-2022-DEPLOY-02-LAW-SECURITY-AI (GA: 101102641)