================================================================================
Exogenous Geological Process Activations in the Russian Federation (2007–2025)
================================================================================

PURPOSE
This file contains brief information about the dataset “Exogenous Geological Process Activations in the Russian Federation (2007–2025)”.

ARCHIVE CONTENTS
The archive contains the following files:
– egp_archive.csv – Main data table (UTF-8, comma‑separated).
– egp_archive.xlsx – Same data in Microsoft Excel format.
– README_EN.txt – English version.
- Codebook.pdf - Detailed fields description and codebooks.

DATASET OVERVIEW
- DOI: 10.5281/zenodo.19787928
- Total records: 13 086
- Time span: Q1 2007 – Q4 2025
- Geographic coverage: 8 federal districts of the Russian Federation
- Records with precise coordinates (from 2019): 7 367 (≈56%)
- Records without coordinates (suitable for aggregated analysis, filled with 0): 1 032 (≈8%)
- Additional geocoded records (pre‑2019): 4 687 (accuracy categories: “EGP point”, “settlement/road”,  or “region”)

DATA ORIGIN: PUBLIC REPORTS - FEDERAL AGENCY FOR MINERAL RESOURCES (ROSNEDRA)
The data are derived from official quarterly information summaries of the
Federal Agency for Subsoil Use of the Russian Federation (Rosnedra).
The summaries are prepared by the Federal State Budgetary Institution
“Gidrospetsgeologiya”. Original reports are available at:
https://geomonitoring.ru/MEGP.php

AUTHORS: Kachur Daria, Derkacheva Anna - HSE Univesity

GEOCODING METHODS (for records missing coordinates before 2019)
Three complementary methods were used to recover spatial coordinates:

1. Nominatim (OpenStreetMap)
  – Batch geocoding of address strings using the Nominatim service.
  – A Python script with rate‑limiting (2 sec delays) and address normalisation
    (replacing abbreviations, truncating extra markers) was implemented.
  – All obtained coordinates were verified by reverse geocoding: only those
    matching at least the region and district were kept.
  – Result: 1,099 verified coordinates.

2.SKDF Road Database
  – For records containing references to highways (e.g., “a/d … km”).
  – The reference cartographic road database (SKDF), provided by the research
    supervisor, was used. It contains road linear object with attributes (region,
    city/district, road name).
  – A matching algorithm extracted road names and place names from the address,
    normalised them, and searched the index. If a match was found, the midpoint
    of the road segment was taken as coordinates.
  – Result: 7 verified coordinates.

3.YandexGPT (Yandex Cloud AI, model yandexgpt-5-lite/latest)
  – A detailed prompt was sent to the API, requesting a JSON response with
    latitude, longitude, region, district, and place name.
  – Temperature was set to 0.2 to ensure stable output.
  – All results were reverse‑geocoded for verification; only those with
    region/district consistency were retained.
  – Result: 3,581 verified coordinates.

All geocoded records were assigned an accuracy category (see below).

FIELD DESCRIPTIONS
- id                                  – unique event identifier
- process_type                        – EGP type abbreviation (e.g., "Оп" - 
                                        landslides , "Пт" - water logging, "Ка" -karst)
- latitude, longitude                 – coordinates in WGS‑84
- accuracy                            – coordinates accuracy notification (see below)
- date_start, month_start, year_start – start of the process
- date_end, month_end, year_end       – end of the process
- flag_completed                      – whether the process had ended by the report date
- activation_factors                  – activation triggers (systematically filled from 2019)
- impact_description                  – description of the manifestation and consequences
- federal_district, region            – administrative location
- location                            – address description of the EGP occurrence
- coord_source                        – source of coordinates (rosnedra, nominatim, yandexgpt-5-lite, skdf)

IMPORTANT LIMITATIONS AND NOTES
1. Coordinate accuracy:
   - «точка ЭГП» = “EGP point” (high precision) – 56% of records with coordinates, taken directly
     from reports for 2019–2025, where the exact location of the EGP is provided.

   - «поселение/дорога» = “settlement/road” (medium precision) – 7% of records with coordinates;
     the point is linked to a specific settlement or linear object (road).
     This allows using such points in models with high spatial detail (e.g., when
     matching raster data with 30–100 m cell size such as DEM, climate grids,
     satellite indices) without additional aggregation.

   - «район» = “district” (low precision) – 29% of records with coordinates; coordinates
     correspond to the centroid of a region or district, or obtained only at the
     regional level. Such points are not suitable for modelling with a fine spatial
     step at the national scale, but can be used for modelling and semi‑quantitative
     statistical assessments on a regional grid (e.g., 10×10 km or larger).

2. Coordinate system:
  – Coordinates from reports after 2023 are given in GSK‑2011. The difference
    between GSK‑2011 and WGS‑84 over Russian territory is <0.8 m, which is
    negligible for regional and continental scales. Therefore, no reprojection
    was performed; all coordinates are stored as WGS‑84 decimal degrees.

3. Activation factors:
  – The field “activation_factors” is absent for the period 2007–2019.
    For analyses requiring these factors, it is strongly recommended to use
    data starting from 2019.

4. Administrative changes:
  – The data have been updated to reflect the federal district division as of 2026.
    Interregional boundary adjustments have been taken into account in the region field.

5. Duplicate events:
  – The same ongoing process may appear in several quarterly reports.
    Each appearance is kept as a separate record; the flag flag_completed
    and the end date help to identify the final stage.

RECOMMENDATIONS FOR USE
- For spatial machine learning (e.g., landslide susceptibility mapping, flood hazard
  modelling): use only records with accuracy = “settlement/road”. These can be directly
  matched to raster predictors (DEM, land cover, precipitation) at 30–100 m resolution.

- When using activation factors, limit your analysis to the 2019–2025 period.

- Always consider the incompleteness of the catalogue, especially when comparing
  regions with different monitoring densities.

RELATED RESOURCES
- Original Rosnedra quarterly reports: https://geomonitoring.ru/MEGP.php
- Nominatim geocoder (OSM): https://nominatim.openstreetmap.org/
- Map – SKDF GIS: https://скдф.рф/map

LICENSE AND CITATION
This dataset is distributed under the Creative Commons Attribution 4.0
International (CC BY 4.0) license. You are free to share and adapt the data
for any purpose, provided you give appropriate credit.

Recommended citation:
Kachur Daria; Derkacheva Anna. (2026). Archive of Exogenous Geological Process
Activations in the Russian Federation (2007–2025) (Version 1.0.0) [Dataset].

================================================================================