TabeaSonnenschein/GenSynthPop: R-package for Generating Representative Spatially Explicit Synthetic Populations

Tabea Sonnenschein; de Mooij, Jan; Pellegrino, , Marco; Dastani, Mehdi; Ettema, Dick; Logan, Brian; Verstegen, Judith A.

doi:10.5281/zenodo.13885986

Published September 20, 2024 | Version v2.0.0

Software Open

TabeaSonnenschein/GenSynthPop: R-package for Generating Representative Spatially Explicit Synthetic Populations

1. Utrecht University

Instructions for R-package: GenSynthPop

This repository contains the implementation of GenSynthPop, a sample-free tool to construct Synthetic Populations from mixed-aggregation contingency tables.

This package contains a set of functions that help prepare stratified census datasets to generate conditional propensities, combines the conditional propensities with spatial marginal distributions to generate a representative population and validates that the produced agents have a similar distribution as the initial spatial marginal datasets and the stratified datasets. The generated population is representative for a city or the spatial extent that is fed into the algorithms and can be used for simulation purposes, such as an agent-based model. The smaller the spatial units of the spatial marginal distributions, the more spatially resolved the agents will be too.

Updates

Changes in Version 2.0.0 of the package GenSynthPop compared to Version 1.0.0

* implements iterative proportionate fitting to fit multi-variable joint distributions to spatial marginal distributions.

* implements deterministic assignment, instead of probability distribution sampling

* fuses all steps into a single function, for ease of use

The work in this repository is described in:

de Mooij, J., Sonnenschein, T., Pellegrino, M. et al. GenSynthPop: generating a spatially explicit synthetic population of individuals and households from aggregated data. Auton Agent Multi-Agent Syst 38, 48 (2024).

https://doi.org/10.1007/s10458-024-09680-7

An Python implementation of this library is available here

Main Function

Conditional_attribute_adder(): Adds a target attribute to a synthetic population by fitting it to a contingency table, optionally using iterative proportional fitting (IPF) with margins.

Installing package in R

    install.packages("devtools")
    library(devtools)
    install_github("TabeaSonnenschein/GenSynthPop")
    library(GenSynthPop)

Looking up documentation for a function

There is extensive documentation for the functions within R

Example:

    ?Conditional_attribute_adder
    help(Conditional_attribute_adder)

Should there be remaining questions, shoot me an email: t.s.sonnenschein@uu.nl

Instructions

1. Start by collecting neighborhood marginal distributions of age_groups. It is recommended to go as spatially resolved as you can (smallest spatial unit) but it depends on what you want to use the synthetic agent population for. You theoretically can even use provincial or national administrative areas, if this is your project scope and goal. We go for neighborhoods because we want to create an urban ABM.

2. generate a population by generating unique agents for each person living in each neighborhood


# Load the library
library(GenSynthPop)
neigh_df = read.csv("Neighborhood_statistics.csv")

# Initialize the agent_df
agent_neighborhoods = list()
agent_count = 0
for (i in 1:nrow(neigh_df)) {
  neighb_code = neigh_df[i, "neighb_code"]
  neighb_total = neigh_df[i, "nr_residents"]
  agent_neighborhoods = c(agent_neighborhoods, rep(neighb_code, neighb_total))
  agent_count = agent_count + neighb_total
}
agent_ids = paste0("Agent_", 0:(agent_count - 1))
agent_df = data.frame(agent_id = unlist(agent_ids),
                                       neighb_code = unlist(agent_neighborhoods))

3. use this new agent_df and the neighborhood marginal distribution dataframe to distribute the agents across neighborhoods and age groups.

agecols = c("0-15", "15-25", "25-45", "45-65", "65+")
ageneigh_df = neigh_df[unlist(c("neighb_code", agecols))] %>%
  pivot_longer(cols = all_of(agecols),
               names_to = "age_group",
               values_to = "count")    # Create a new column for counts

ageneigh_df = as.data.frame(ageneigh_df)

agent_df = Conditional_attribute_adder(df = agent_df,
                            df_contingency = ageneigh_df,
                            target_attribute = "age_group",
                            group_by = c("neighb_code"))

print(head(agent_df))

4. Read the stratified dataframe with the conditional variable and the variable of interest (that you want to add), for example sex by agegroup, since we already added that one. Make sure that the classes of the conditional variables correspond to the ones in the agent_df. We can now use additional neighborhood margins that we have

sex_age_df = read.csv("sex_age_statistics.csv") # columns age_group, sex, counts

sexneigh_df <- sexneigh_df[unlist(c("neighb_code", sexcols))] %>%
  pivot_longer(cols = all_of(sexcols),
               names_to = "sex",
               values_to = "count")  
sexneigh_df <- as.data.frame(sexneigh_df)
   
agent_df = Conditional_attribute_adder(df = agent_df,
                            df_contingency = sex_age_df ,
                            target_attribute = "sex",
                            group_by = c("neighb_code"),
                            margins= list(ageneigh_df, sexneigh_df),
                            margins_names= c("age_group", "sex"))
print(head(agent_df))

5. Now we can add multi-variable contingency tables and repeat the function for any data and variables we would like to add. For example let us add education level based on age and sex. We can now use the neighborhood margins for age_group, sex, or even as well for education_level. The function can take contingency tables with any number of variables and any number of neighborhood marginal data. The only requirement is that the conditional variables of the contingency table and marginal data are represented in the agent_df. So all variables apart from the target attribute. The algorithm can deal with cases when no neighborhood marginal data is available for some conditional variables or target attributes.

edu_age_sex_df = read.csv("edu_sex_age_statistics.csv") # columns age_group, sex, education_level counts

agent_df = Conditional_attribute_adder(df = agent_df,
                            df_contingency = edu_age_sex_df ,
                            target_attribute = "education_level",
                            group_by = c("neighb_code"),
                            margins= list(ageneigh_df, sexneigh_df),
                            margins_names= c("age_group", "sex"))
print(head(agent_df))

you can look at the Example_Application_GenSynthPop.R script for an example application of the functions in the package.

License

This package is licensed under the MIT License.

Files

Files (18.8 kB)

Name	Size	Download all
GenSynthPop_2.0.0.tar.gz md5:9975a0eba111bd2facf66e109ab86880	18.8 kB	Download

Additional details

DOI: 10.1007/s10458-024-09680-7

Is supplement to: https://github.com/TabeaSonnenschein/GenSynthPop/tree/v2.0.0 (URL)

Repository URL: https://github.com/TabeaSonnenschein/GenSynthPop/tree/main
Programming language: R
Development Status: Active

	All versions	This version
Views	467	177
Downloads	63	30
Data volume	114.5 MB	582.3 kB

Files (18.8 kB)

Identifiers

Related works

Software

TabeaSonnenschein/GenSynthPop: R-package for Generating Representative Spatially Explicit Synthetic Populations

Authors/Creators

Description

Files

Files (18.8 kB)

Additional details

Identifiers

Related works

Software