geosnap.data.data module¶
Tools for creating and manipulating neighborhood datasets.
-
class
geosnap.data.data.
Community
(gdf=None, harmonized=None, **kwargs)[source]¶ Bases:
object
Spatial and tabular data for a collection of “neighborhoods”.
A community is a collection of “neighborhoods” represented by spatial boundaries (e.g. census tracts, or blocks in the US), and tabular data which describe the composition of each neighborhood (e.g. data from surveys, sensors, or geocoded misc.). A Community can be large (e.g. a metropolitan region), or small (e.g. a handfull of census tracts) and may have data pertaining to multiple discrete points in time.
- Parameters
- gdfgeopandas.GeoDataFrame
long-form geodataframe that holds spatial and tabular data.
- harmonizedbool
Whether neighborhood boundaries have been harmonized into a set of time-consistent units
- **kwargs
- Attributes
- gdfgeopandas.GeoDataFrame
long-form geodataframe that stores neighborhood-level attributes and geometries for one or more time periods
- harmonizedbool
Whether neighborhood boundaries have been harmonized into consistent units over time
Methods
cluster
(self[, n_clusters, method, …])Create a geodemographic typology by running a cluster analysis on the study area’s neighborhood attributes
cluster_spatial
(self[, n_clusters, …])Create a spatial geodemographic typology by running a cluster analysis on the metro area’s neighborhood attributes and including a contiguity constraint.
from_census
([state_fips, county_fips, …])Create a new Community from original vintage US Census data.
from_geodataframes
([gdfs])Create a new Community from a list of geodataframes.
from_lodes
([state_fips, county_fips, …])Create a new Community from Census LEHD/LODES data.
from_ltdb
([state_fips, county_fips, …])Create a new Community from LTDB data.
from_ncdb
([state_fips, county_fips, …])Create a new Community from NCDB data.
harmonize
(self[, target_year, …])Short summary.
sequence
(self, cluster_col[, seq_clusters, …])Pairwise sequence analysis to evaluate the distance/dissimilarity between every two neighborhood sequences.
transition
(self, cluster_col[, time_var, …])(Spatial) Markov approach to transitional dynamics of neighborhoods.
-
cluster
(self, n_clusters=6, method=None, best_model=False, columns=None, verbose=False, return_model=False, scaler=None, **kwargs)[source]¶ Create a geodemographic typology by running a cluster analysis on the study area’s neighborhood attributes
- Parameters
- gdfpandas.DataFrame
long-form (geo)DataFrame containing neighborhood attributes
- n_clustersint
the number of clusters to model. The default is 6).
- methodstr
the clustering algorithm used to identify neighborhood types
- best_modelbool
if using a gaussian mixture model, use BIC to choose the best n_clusters. (the default is False).
- columnslist-like
subset of columns on which to apply the clustering
- verbosebool
whether to print warning messages (the default is False).
- return_modelbool
whether to return the underlying cluster model instance for further analysis
- scaler: str or sklearn.preprocessing.Scaler
a scikit-learn preprocessing class that will be used to rescale the data. Defaults to StandardScaler
- Returns
- pandas.DataFrame with a column of neighborhood cluster labels appended
- as a new column. Will overwrite columns of the same name.
-
cluster_spatial
(self, n_clusters=6, spatial_weights='rook', method=None, best_model=False, columns=None, threshold_variable='count', threshold=10, return_model=False, scaler=None, **kwargs)[source]¶ Create a spatial geodemographic typology by running a cluster analysis on the metro area’s neighborhood attributes and including a contiguity constraint.
- Parameters
- gdfgeopandas.GeoDataFrame
long-form geodataframe holding neighborhood attribute and geometry data.
- n_clustersint
the number of clusters to model. The default is 6).
- weights_typestr ‘queen’ or ‘rook’
spatial weights matrix specification` (the default is “rook”).
- methodstr
the clustering algorithm used to identify neighborhood types
- best_modeltype
Description of parameter best_model (the default is False).
- columnslist-like
subset of columns on which to apply the clustering
- threshold_variablestr
for max-p, which variable should define p. The default is “count”, which will grow regions until the threshold number of polygons have been aggregated
- thresholdnumeric
threshold to use for max-p clustering (the default is 10).
- return_modelbool
whether to return the underlying cluster model instance for further analysis
- scaler: str or sklearn.preprocessing.Scaler
a scikit-learn preprocessing class that will be used to rescale the data. Defaults to StandardScaler
- Returns
- geopandas.GeoDataFrame with a column of neighborhood cluster labels
- appended as a new column. Will overwrite columns of the same name.
-
classmethod
from_census
(state_fips=None, county_fips=None, msa_fips=None, fips=None, boundary=None, years=[1990, 2000, 2010])[source]¶ Create a new Community from original vintage US Census data.
Instiantiate a new Community from . To use you must first download and register census data with geosnap using the store_census function. Pass lists of states, counties, or any arbitrary FIPS codes to create a community. All fips code arguments are additive, so geosnap will include the largest unique set. Alternatively, you may provide a boundary to use as a clipping feature.
- Parameters
- state_fipslist or str
string or list of strings of two-digit fips codes defining states to include in the study area.
- county_fipslist or str
string or list of strings of five-digit fips codes defining counties to include in the study area.
- msa_fipstype
string or list of strings of fips codes defining MSAs to include in the study area.
- fipstype
string or list of strings of five-digit fips codes defining counties to include in the study area.
- boundary: geopandas.GeoDataFrame
geodataframe that defines the total extent of the study area. This will be used to clip tracts lazily by selecting all `GeoDataFrame.representative_point()`s that intersect the boundary gdf
- yearslist
list of years to include in the study data (the default is [1990, 2000, 2010]).
- Returns
- Community
Community with unharmonized census data
-
classmethod
from_geodataframes
(gdfs=None)[source]¶ Create a new Community from a list of geodataframes.
- Parameters
- gdfslist-like
list of geodataframes that hold attribute and geometry data for a study area. Each geodataframe must have neighborhood attribute data, geometry data, and a time column that defines how the geodataframes are sequenced. The geometries may be stable over time (in which case the dataset is harmonized) or may be unique for each time. If the data are harmonized, the dataframes must also have an ID variable that indexes neighborhood units over time.
-
classmethod
from_lodes
(state_fips=None, county_fips=None, msa_fips=None, fips=None, boundary=None, years=2015, dataset='wac')[source]¶ Create a new Community from Census LEHD/LODES data.
Instiantiate a new Community from LODES data. Pass lists of states, counties, or any arbitrary FIPS codes to create a community. All fips code arguments are additive, so geosnap will include the largest unique set. Alternatively, you may provide a boundary to use as a clipping feature.
- Parameters
- state_fipslist or str
string or list of strings of two-digit fips codes defining states to include in the study area.
- county_fipslist or str
string or list of strings of five-digit fips codes defining counties to include in the study area.
- msa_fipstype
string or list of strings of fips codes defining MSAs to include in the study area.
- fipstype
string or list of strings of five-digit fips codes defining counties to include in the study area.
- boundary: geopandas.GeoDataFrame
geodataframe that defines the total extent of the study area. This will be used to clip tracts lazily by selecting all `GeoDataFrame.representative_point()`s that intersect the boundary gdf
- yearslist
list of years to include in the study data (the default is [1990, 2000, 2010]).
- dataset: str
which LODES dataset should be used to create the Community. Options are ‘wac’ for workplace area characteristics or ‘rac’ for residence area characteristics.
- Returns
- Community
Community with LODES data
-
classmethod
from_ltdb
(state_fips=None, county_fips=None, msa_fips=None, fips=None, boundary=None, years=[1970, 1980, 1990, 2000, 2010])[source]¶ Create a new Community from LTDB data.
Instiantiate a new Community from pre-harmonized LTDB data. To use you must first download and register LTDB data with geosnap using the store_ltdb function. Pass lists of states, counties, or any arbitrary FIPS codes to create a community. All fips code arguments are additive, so geosnap will include the largest unique set. Alternatively, you may provide a boundary to use as a clipping feature.
- Parameters
- state_fipslist or str
string or list of strings of two-digit fips codes defining states to include in the study area.
- county_fipslist or str
string or list of strings of five-digit fips codes defining counties to include in the study area.
- msa_fipstype
string or list of strings of fips codes defining MSAs to include in the study area.
- fipstype
string or list of strings of five-digit fips codes defining counties to include in the study area.
- boundary: geopandas.GeoDataFrame
geodataframe that defines the total extent of the study area. This will be used to clip tracts lazily by selecting all `GeoDataFrame.representative_point()`s that intersect the boundary gdf
- yearslist
list of years (decades) to include in the study data (the default is [1970, 1980, 1990, 2000, 2010]).
- Returns
- Community
Community with LTDB data
-
classmethod
from_ncdb
(state_fips=None, county_fips=None, msa_fips=None, fips=None, boundary=None, years=[1970, 1980, 1990, 2000, 2010])[source]¶ Create a new Community from NCDB data.
Instiantiate a new Community from pre-harmonized NCDB data. To use you must first download and register LTDB data with geosnap using the store_ncdb function. Pass lists of states, counties, or any arbitrary FIPS codes to create a community. All fips code arguments are additive, so geosnap will include the largest unique set. Alternatively, you may provide a boundary to use as a clipping feature.
- Parameters
- state_fipslist or str
string or list of strings of two-digit fips codes defining states to include in the study area.
- county_fipslist or str
string or list of strings of five-digit fips codes defining counties to include in the study area.
- msa_fipstype
string or list of strings of fips codes defining MSAs to include in the study area.
- fipstype
string or list of strings of five-digit fips codes defining counties to include in the study area.
- boundary: geopandas.GeoDataFrame
geodataframe that defines the total extent of the study area. This will be used to clip tracts lazily by selecting all `GeoDataFrame.representative_point()`s that intersect the boundary gdf
- yearslist
list of years (decades) to include in the study data (the default is [1970, 1980, 1990, 2000, 2010]).
- Returns
- Community
Community with NCDB data
-
harmonize
(self, target_year=None, weights_method='area', extensive_variables=None, intensive_variables=None, allocate_total=True, raster='nlcd_2011', codes=[21, 22, 23, 24], force_crs_match=True)[source]¶ Short summary.
- Parameters
- target_year: int
Polygons from this year will become the target boundaries for spatial interpolation.
- weights_methodstring
The method that the harmonization will be conducted. This can be set to:
- “area”harmonization according to
area weights.
- “land_type_area”harmonization according to
the Land Types considered ‘populated’ areas.
“land_type_Poisson_regression” : NOT YET INTRODUCED. “land_type_Gaussian_regression” : NOT YET INTRODUCED.
- extensive_variableslist
extensive variables to be used in interpolation.
- intensive_variablestype
intensive variables to be used in interpolation.
- allocate_totalboolean
True if total value of source area should be allocated. False if denominator is area of i. Note that the two cases would be identical when the area of the source polygon is exhausted by intersections. See (3) in Notes for more details
- raster_pathstr
path to the raster image that has the types of each pixel in the spatial context. Only taken into consideration for harmonization raster based.
- codeslist
pixel values that should be included in the regression (the default is [21, 22, 23, 24]).
- force_crs_matchbool
whether source and target dataframes should be reprojected to match (the default is True).
- Returns
- None
New data are added to the input Community
-
sequence
(self, cluster_col, seq_clusters=5, subs_mat=None, dist_type=None, indel=None, time_var='year', id_var='geoid')[source]¶ Pairwise sequence analysis to evaluate the distance/dissimilarity between every two neighborhood sequences.
The sequence approach should be adopted after neighborhood segmentation since the column name of neighborhood labels is a required input.
- Parameters
- cluster_colstring or int
Column name for the neighborhood segmentation, such as “ward”, “kmeans”, etc.
- seq_clustersint, optional
Number of neighborhood sequence clusters. Agglomerative Clustering with Ward linkage is now used for clustering the sequences. Default is 5.
- subs_matarray
(k,k), substitution cost matrix. Should be hollow ( 0 cost between the same type), symmetric and non-negative.
- dist_typestring
“hamming”: hamming distance (substitution only and its cost is constant 1) from sklearn.metrics; “markov”: utilize empirical transition probabilities to define substitution costs; “interval”: differences between states are used to define substitution costs, and indel=k-1; “arbitrary”: arbitrary distance if there is not a strong theory guidance: substitution=0.5, indel=1. “tran”: transition-oriented optimal matching. Sequence of transitions. Based on [Bie11].
- indelfloat, optional
insertion/deletion cost.
- time_varstring, optional
Column defining time and or sequencing of the long-form data. Default is “year”.
- id_varstring, optional
Column identifying the unique id of spatial units. Default is “geoid”.
-
transition
(self, cluster_col, time_var='year', id_var='geoid', w_type=None, permutations=0)[source]¶ (Spatial) Markov approach to transitional dynamics of neighborhoods.
The transitional dynamics approach should be adopted after neighborhood segmentation since the column name of neighborhood labels is a required input.
- Parameters
- cluster_colstring or int
Column name for the neighborhood segmentation, such as “ward”, “kmeans”, etc.
- time_varstring, optional
Column defining time and or sequencing of the long-form data. Default is “year”.
- id_varstring, optional
Column identifying the unique id of spatial units. Default is “geoid”.
- w_typestring, optional
Type of spatial weights type (“rook”, “queen”, “knn” or “kernel”) to be used for spatial structure. Default is None, if non-spatial Markov transition rates are desired.
- permutationsint, optional
number of permutations for use in randomization based inference (the default is 0).
-
class
geosnap.data.data.
DataStore
[source]¶ Bases:
object
Storage for geosnap data. Currently supports US Census data.
- Attributes
codebook
Codebook.
ltdb
Longitudinal Tract Database (LTDB).
msa_definitions
2010 Metropolitan Statistical Area definitions.
ncdb
Geolytics Neighborhood Change Database (NCDB).
Methods
blocks_2000
(self[, states, convert])Census blocks for 2000.
blocks_2010
(self[, states, convert])Census blocks for 2010.
counties
(self[, convert])Nationwide counties as drawn in 2010.
msas
(self[, convert])Metropolitan Statistical Areas as drawn in 2010.
states
(self[, convert])States.
tracts_1990
(self[, states, convert])Nationwide Census Tracts as drawn in 1990 (cartographic 500k).
tracts_2000
(self[, states, convert])Nationwide Census Tracts as drawn in 2000 (cartographic 500k).
tracts_2010
(self[, states, convert])Nationwide Census Tracts as drawn in 2010 (cartographic 500k).
-
blocks_2000
(self, states=None, convert=True)[source]¶ Census blocks for 2000.
- Parameters
- stateslist-like
list of state fips codes to return as a datafrrame.
- convertbool
- if True, return geodataframe, else return dataframe (the default is True).
- Returns
- type
- pandas.DataFrame or geopandas.GeoDataFrame.
2000 blocks as a geodataframe or as a dataframe with geometry stored as well-known binary on the ‘wkb’ column.
-
blocks_2010
(self, states=None, convert=True)[source]¶ Census blocks for 2010.
- Parameters
- stateslist-like
list of state fips codes to return as a datafrrame.
- convertbool
- if True, return geodataframe, else return dataframe (the default is True).
- Returns
- type
- pandas.DataFrame or geopandas.GeoDataFrame.
2010 blocks as a geodataframe or as a dataframe with geometry stored as well-known binary on the ‘wkb’ column.
-
property
codebook
¶ Codebook.
- Parameters
- None
- Returns
- pandas.DataFrame.
codebook that stores variable names, definitions, and formulas.
-
counties
(self, convert=True)[source]¶ Nationwide counties as drawn in 2010.
- Parameters
- convertbool
if True, return geodataframe, else return dataframe (the default is True).
- Returns
- geopandas.GeoDataFrame.
2010 counties as a geodataframe or as a dataframe with geometry stored as well-known binary on the ‘wkb’ column.
-
property
ltdb
¶ Longitudinal Tract Database (LTDB).
- Parameters
- None
- Returns
- pandas.DataFrame or geopandas.GeoDataFrame
LTDB as a long-form geo/dataframe
-
property
msa_definitions
¶ 2010 Metropolitan Statistical Area definitions.
- Parameters
- None
- Returns
- pandas.DataFrame.
dataframe that stores state/county –> MSA crosswalk definitions.
-
msas
(self, convert=True)[source]¶ Metropolitan Statistical Areas as drawn in 2010.
- Parameters
- convertbool
if True, return geodataframe, else return dataframe (the default is True).
- Returns
- geopandas.GeoDataFrame.
2010 MSAs as a geodataframe or as a dataframe with geometry stored as well-known binary on the ‘wkb’ column.
-
property
ncdb
¶ Geolytics Neighborhood Change Database (NCDB).
- Parameters
- None
- Returns
- pandas.DataFrarme
NCDB as a long-form dataframe
-
states
(self, convert=True)[source]¶ States.
- Parameters
- convertbool
if True, return geodataframe, else return dataframe (the default is True).
- Returns
- geopandas.GeoDataFrame.
US States as a geodataframe or as a dataframe with geometry stored as well-known binary on the ‘wkb’ column.
-
tracts_1990
(self, states=None, convert=True)[source]¶ Nationwide Census Tracts as drawn in 1990 (cartographic 500k).
- Parameters
- stateslist-like
list of state fips to subset the national dataframe
- convertbool
if True, return geodataframe, else return dataframe (the default is True).
- Returns
- pandas.DataFrame or geopandas.GeoDataFrame.
1990 tracts as a geodataframe or as a dataframe with geometry stored as well-known binary on the ‘wkb’ column.
-
tracts_2000
(self, states=None, convert=True)[source]¶ Nationwide Census Tracts as drawn in 2000 (cartographic 500k).
- Parameters
- stateslist-like
list of state fips to subset the national dataframe
- convertbool
if True, return geodataframe, else return dataframe (the default is True).
- Returns
- pandas.DataFrame.
2000 tracts as a geodataframe or as a dataframe with geometry stored as well-known binary on the ‘wkb’ column.
-
tracts_2010
(self, states=None, convert=True)[source]¶ Nationwide Census Tracts as drawn in 2010 (cartographic 500k).
- Parameters
- stateslist-like
list of state fips to subset the national dataframe
- convertbool
if True, return geodataframe, else return dataframe (the default is True).
- Returns
- pandas.DataFrame.
2010 tracts as a geodataframe or as a dataframe with geometry stored as well-known binary on the ‘wkb’ column.
-
geosnap.data.data.
store_blocks_2000
()[source]¶ Save census 2000 census block data to the local quilt package storage.
- Parameters
- None
- Returns
- None
Data will be available in the geosnap.data.data_store and will be used in place of streaming data for all census queries.
-
geosnap.data.data.
store_blocks_2010
()[source]¶ Save census 2010 census block data to the local quilt package storage.
- Parameters
- None
- Returns
- None
Data will be available in the geosnap.data.data_store and will be used in place of streaming data for all census queries.
-
geosnap.data.data.
store_census
()[source]¶ Save census data to the local quilt package storage.
- Parameters
- None
- Returns
- None
Data will be available in the geosnap.data.data_store and will be used in place of streaming data for all census queries. The raster package is 3.05 GB.
-
geosnap.data.data.
store_ltdb
(sample, fullcount)[source]¶ Read & store data from Brown’s Longitudinal Tract Database (LTDB).
- Parameters
- samplestr
file path of the zip file containing the standard Sample CSV files downloaded from https://s4.ad.brown.edu/projects/diversity/Researcher/LTBDDload/Default.aspx
- fullcount: str
file path of the zip file containing the standard Fullcount CSV files downloaded from https://s4.ad.brown.edu/projects/diversity/Researcher/LTBDDload/Default.aspx
- Returns
- pandas.DataFrame