Published September 20, 2024 | Version v2.0.0
Software Open

TabeaSonnenschein/GenSynthPop: R-package for Generating Representative Spatially Explicit Synthetic Populations

Description

Instructions for R-package: GenSynthPop
 
This repository contains the implementation of GenSynthPop, a sample-free tool to construct Synthetic Populations from mixed-aggregation contingency tables.
 
This package contains a set of functions that help prepare stratified census datasets to generate conditional propensities, combines the conditional propensities with spatial marginal distributions to generate a representative population and validates that the produced agents have a similar distribution as the initial spatial marginal datasets and the stratified datasets. The generated population is  representative for a city or the spatial extent that is fed into the algorithms and can be used for simulation purposes, such as an agent-based model. The smaller the spatial units of the spatial marginal distributions, the more spatially resolved the agents will be too.
 
Updates
Changes in Version 2.0.0 of the package GenSynthPop compared to Version 1.0.0
* implements iterative proportionate fitting to fit multi-variable joint distributions to spatial marginal distributions.
* implements deterministic assignment, instead of probability distribution sampling
* fuses all steps into a single function, for ease of use
 
The work in this repository is described in:
 de Mooij, J., Sonnenschein, T., Pellegrino, M. et al. GenSynthPop: generating a spatially explicit synthetic population of individuals and households from aggregated data. Auton Agent Multi-Agent Syst 38, 48 (2024).
 
An Python implementation of this library is available here
 
Main Function
Conditional_attribute_adder(): Adds a target attribute to a synthetic population by fitting it to a contingency table, optionally using iterative proportional fitting (IPF) with margins.
 
Installing package in R
 
 
    install.packages("devtools")
    library(devtools)
    install_github("TabeaSonnenschein/GenSynthPop")
    library(GenSynthPop)
 
Looking up documentation for a function
There is extensive documentation for the functions within R
 
Example:
 
    ?Conditional_attribute_adder
    help(Conditional_attribute_adder)
Should there be remaining questions, shoot me an email: t.s.sonnenschein@uu.nl
 
Instructions
 
1. Start by collecting neighborhood marginal distributions of age_groups. It is recommended to go as spatially resolved as you can (smallest spatial unit) but it depends on what you want to use the synthetic agent population for. You theoretically can even use provincial or national administrative areas, if this is your project scope and goal. We go for neighborhoods because we want to  create an urban ABM.
 
2. generate a population by generating unique agents for each person living in each neighborhood

# Load the library
library(GenSynthPop)
neigh_df = read.csv("Neighborhood_statistics.csv")

# Initialize the agent_df
agent_neighborhoods = list()
agent_count = 0
for (i in 1:nrow(neigh_df)) {
  neighb_code = neigh_df[i, "neighb_code"]
  neighb_total = neigh_df[i, "nr_residents"]
  agent_neighborhoods = c(agent_neighborhoods, rep(neighb_code, neighb_total))
  agent_count = agent_count + neighb_total
}
agent_ids = paste0("Agent_", 0:(agent_count - 1))
agent_df = data.frame(agent_id = unlist(agent_ids),
                                       neighb_code = unlist(agent_neighborhoods))
 
3. use this new agent_df and the neighborhood marginal distribution dataframe to distribute the agents across neighborhoods and age groups.
 
agecols = c("0-15", "15-25", "25-45", "45-65", "65+")
ageneigh_df = neigh_df[unlist(c("neighb_code", agecols))] %>%
  pivot_longer(cols = all_of(agecols),
               names_to = "age_group",
               values_to = "count")    # Create a new column for counts

ageneigh_df = as.data.frame(ageneigh_df)

agent_df = Conditional_attribute_adder(df = agent_df,
                            df_contingency = ageneigh_df,
                            target_attribute = "age_group",
                            group_by = c("neighb_code"))

print(head(agent_df))
 
4. Read the stratified dataframe with the conditional variable and the variable of interest (that you want to add), for example sex by agegroup, since we already added that one. Make sure that the classes of the conditional variables correspond to the ones in the agent_df. We can now use additional neighborhood margins that we have
 
sex_age_df = read.csv("sex_age_statistics.csv") # columns age_group, sex, counts

sexneigh_df <- sexneigh_df[unlist(c("neighb_code", sexcols))] %>%
  pivot_longer(cols = all_of(sexcols),
               names_to = "sex",
               values_to = "count")  
sexneigh_df <- as.data.frame(sexneigh_df)
   
agent_df = Conditional_attribute_adder(df = agent_df,
                            df_contingency = sex_age_df ,
                            target_attribute = "sex",
                            group_by = c("neighb_code"),
                            margins= list(ageneigh_df, sexneigh_df),
                            margins_names= c("age_group", "sex"))
print(head(agent_df))
 
5. Now we can add multi-variable contingency tables and repeat the function for any data and variables we would like to add. For example let us add education level based on age and sex. We can now use the neighborhood margins for age_group, sex, or even as well for education_level. The function can take contingency tables with any number of variables and any number of neighborhood marginal data. The only requirement is that the conditional variables of the contingency table and marginal data are represented in the agent_df. So all variables apart from the target attribute. The algorithm can deal with cases when no neighborhood marginal data is available for some conditional variables or target attributes.
 
edu_age_sex_df = read.csv("edu_sex_age_statistics.csv") # columns age_group, sex, education_level counts

agent_df = Conditional_attribute_adder(df = agent_df,
                            df_contingency = edu_age_sex_df ,
                            target_attribute = "education_level",
                            group_by = c("neighb_code"),
                            margins= list(ageneigh_df, sexneigh_df),
                            margins_names= c("age_group", "sex"))
print(head(agent_df))

 

you can look at the Example_Application_GenSynthPop.R script for an example application of the functions in the package.
 
License
This package is licensed under the MIT License.

Files

Files (18.8 kB)

Name Size Download all
md5:9975a0eba111bd2facf66e109ab86880
18.8 kB Download

Additional details

Software

Repository URL
https://github.com/TabeaSonnenschein/GenSynthPop/tree/main
Programming language
R
Development Status
Active