geosnap.analyze package

Module contents

geosnap.analyze.linc(labels_sequence)[source]

Local Indicator of Neighborhood Change

Returns
lincs: array

local indicator of neighborhood change over all periods

Notes

The local indicator of neighborhood change defined here allows for singleton neighborhoods (i.e., neighborhoods composed of a single primitive area such as a tract or block.). This is in contrast to the initial implementation in [RAF+11] which prohibited singletons.

Examples

Time period 0 has the city defined as four neighborhoods on 10 tracts:

>>> labels_0 = [1, 1, 1, 1, 2, 2, 3, 3, 3, 4]

Time period 1 in the same city, with slight change in composition of the four neighborhoods

>>> labels_1 = [1, 1, 1, 1, 1, 2, 3, 3, 3, 4]
>>> res = linc([labels_0, labels_1])
>>> res[4]
1.0
>>> res[1]
0.25
>>> res[7]
0.0
>>> res[-1]
0.0

And, in period 2, no change

>>> labels_2 = [1, 1, 1, 1, 1, 2, 3, 3, 3, 4]
>>> res = linc([labels_1, labels_2])
>>> res[0]
0.0

We can pass more than two time periods, and get a “time-wise global linc” for each unit

>>> res = linc([labels_0, labels_1, labels_2])
>>> res[0]
0.25
geosnap.analyze.sequence(gdf, cluster_col, seq_clusters=5, subs_mat=None, dist_type=None, indel=None, time_var='year', id_var='geoid')[source]

Pairwise sequence analysis and sequence clustering.

Dynamic programming if optimal matching.

Parameters
gdf(geo)DataFrame

Long-form (geo)DataFrame containing neighborhood attributes with a column defining neighborhood clusters.

cluster_colstring or int

Column name for the neighborhood segmentation, such as “ward”, “kmeans”, etc.

seq_clustersint, optional

Number of neighborhood sequence clusters. Agglomerative Clustering with Ward linkage is now used for clustering the sequences. Default is 5.

dist_typestring

“hamming”: hamming distance (substitution only and its cost is constant 1) from sklearn.metrics; “markov”: utilize empirical transition probabilities to define substitution costs; “interval”: differences between states are used to define substitution costs, and indel=k-1; “arbitrary”: arbitrary distance if there is not a strong theory guidance: substitution=0.5, indel=1. “tran”: transition-oriented optimal matching. Sequence of transitions. Based on [Bie11].

subs_matarray

(k,k), substitution cost matrix. Should be hollow ( 0 cost between the same type), symmetric and non-negative.

indelfloat, optional

insertion/deletion cost.

time_varstring, optional

Column defining time and or sequencing of the long-form data. Default is “year”.

id_varstring, optional

Column identifying the unique id of spatial units. Default is “geoid”.

Examples

>>> from geosnap.data import Community
>>> columbus = Community.from_ltdb(msa_fips=columbusfips)
>>> columbus1 = columbus.cluster(columns=['median_household_income',
... 'p_poverty_rate', 'p_edu_college_greater', 'p_unemployment_rate'],
... method='ward', n_clusters=6)
>>> gdf = columbus1.gdf
>>> gdf_new, df_wide, seq_hamming = Sequence(gdf, dist_type="hamming")
>>> seq_hamming.seq_dis_mat[:5, :5]
array([[0., 3., 4., 5., 5.],
       [3., 0., 3., 3., 3.],
       [4., 3., 0., 2., 2.],
       [5., 3., 2., 0., 0.],
       [5., 3., 2., 0., 0.]])
geosnap.analyze.transition(gdf, cluster_col, time_var='year', id_var='geoid', w_type=None, permutations=0)[source]

(Spatial) Markov approach to transitional dynamics of neighborhoods.

Parameters
gdf(geo)DataFrame

Long-form (geo)DataFrame containing neighborhood attributes with a column defining neighborhood clusters.

cluster_colstring or int

Column name for the neighborhood segmentation, such as “ward”, “kmeans”, etc.

time_varstring, optional

Column defining time and or sequencing of the long-form data. Default is “year”.

id_varstring, optional

Column identifying the unique id of spatial units. Default is “geoid”.

w_typestring, optional

Type of spatial weights type (“rook”, “queen”, “knn” or “kernel”) to be used for spatial structure. Default is None, if non-spatial Markov transition rates are desired.

permutationsint, optional

number of permutations for use in randomization based inference (the default is 0).

Examples

>>> from geosnap.data import Community
>>> columbus = Community.from_ltdb(msa_fips=columbusfips)
>>> columbus1 = columbus.cluster(columns=['median_household_income',
... 'p_poverty_rate', 'p_edu_college_greater', 'p_unemployment_rate'],
... method='ward', n_clusters=6)
>>> gdf = columbus1.gdf
>>> a = transition(gdf, "ward", w_type="rook")
>>> a.p
array([[0.79189189, 0.00540541, 0.0027027 , 0.13243243, 0.06216216,
    0.00540541],
   [0.0203252 , 0.75609756, 0.10569106, 0.11382114, 0.        ,
    0.00406504],
   [0.00917431, 0.20183486, 0.75229358, 0.01834862, 0.        ,
    0.01834862],
   [0.1959799 , 0.18341709, 0.00251256, 0.61809045, 0.        ,
    0.        ],
   [0.32307692, 0.        , 0.        , 0.        , 0.66153846,
    0.01538462],
   [0.09375   , 0.0625    , 0.        , 0.        , 0.        ,
    0.84375   ]])
>>> a.P[0]
array([[0.82119205, 0.        , 0.        , 0.10927152, 0.06622517,
    0.00331126],
   [0.14285714, 0.57142857, 0.14285714, 0.14285714, 0.        ,
    0.        ],
   [0.5       , 0.        , 0.5       , 0.        , 0.        ,
    0.        ],
   [0.21428571, 0.14285714, 0.        , 0.64285714, 0.        ,
    0.        ],
   [0.18918919, 0.        , 0.        , 0.        , 0.78378378,
    0.02702703],
   [0.28571429, 0.        , 0.        , 0.        , 0.        ,
    0.71428571]])
geosnap.analyze.cluster(gdf, n_clusters=6, method=None, best_model=False, columns=None, verbose=False, time_var='year', id_var='geoid', return_model=False, scaler=None, **kwargs)[source]
Create a geodemographic typology by running a cluster analysis on the

study area’s neighborhood attributes

Parameters
gdfpandas.DataFrame

long-form (geo)DataFrame containing neighborhood attributes

n_clustersint

the number of clusters to model. The default is 6).

methodstr

the clustering algorithm used to identify neighborhood types

best_modelbool

if using a gaussian mixture model, use BIC to choose the best n_clusters. (the default is False).

columnslist-like

subset of columns on which to apply the clustering

verbosebool

whether to print warning messages (the default is False).

time_var: str

which column on the dataframe defines time and or sequencing of the long-form data. Default is “year”

id_var: str

which column on the long-form dataframe identifies the stable units over time. In a wide-form dataset, this would be the unique index

scaler: str or sklearn.preprocessing.Scaler

a scikit-learn preprocessing class that will be used to rescale the data. Defaults to StandardScaler

Returns
pandas.DataFrame with a column of neighborhood cluster labels appended
as a new column. Will overwrite columns of the same name.
geosnap.analyze.cluster_spatial(gdf, n_clusters=6, spatial_weights='rook', method=None, columns=None, threshold_variable='count', threshold=10, time_var='year', id_var='geoid', return_model=False, scaler=None, **kwargs)[source]

Create a spatial geodemographic typology by running a cluster analysis on the metro area’s neighborhood attributes and including a contiguity constraint.

Parameters
gdfgeopandas.GeoDataFrame

long-form geodataframe holding neighborhood attribute and geometry data.

n_clustersint

the number of clusters to model. The default is 6).

weights_typestr ‘queen’ or ‘rook’

spatial weights matrix specification` (the default is “rook”).

methodstr

the clustering algorithm used to identify neighborhood types

columnslist-like

subset of columns on which to apply the clustering

threshold_variablestr

for max-p, which variable should define p. The default is “count”, which will grow regions until the threshold number of polygons have been aggregated

thresholdnumeric

threshold to use for max-p clustering (the default is 10).

time_var: str

which column on the dataframe defines time and or sequencing of the long-form data. Default is “year”

id_var: str

which column on the long-form dataframe identifies the stable units over time. In a wide-form dataset, this would be the unique index

scaler: str or sklearn.preprocessing.Scaler

a scikit-learn preprocessing class that will be used to rescale the data. Defaults to StandardScaler

Returns
geopandas.GeoDataFrame with a column of neighborhood cluster labels
appended as a new column. Will overwrite columns of the same name.