# report python version and install required packages
# switch if statement from True to False once packages have been installed
if True:
    !python -V
    !pip install -r ../environment/requirements.txt

Python 3.9.18
Requirement already satisfied: pandas==1.4.4 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from -r ../environment/requirements.txt (line 1)) (1.4.4)
Requirement already satisfied: geopandas==0.13.2 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from -r ../environment/requirements.txt (line 2)) (0.13.2)
Requirement already satisfied: pyhere==1.0.0 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from -r ../environment/requirements.txt (line 3)) (1.0.0)
Requirement already satisfied: scipy==1.10.1 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from -r ../environment/requirements.txt (line 4)) (1.10.1)
Requirement already satisfied: numpy==1.21.5 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from -r ../environment/requirements.txt (line 5)) (1.21.5)
Requirement already satisfied: MDP==3.5 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from -r ../environment/requirements.txt (line 6)) (3.5)
Requirement already satisfied: pygris==0.1.5 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from -r ../environment/requirements.txt (line 7)) (0.1.5)
Requirement already satisfied: libpysal==4.7.0 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from -r ../environment/requirements.txt (line 8)) (4.7.0)
Requirement already satisfied: lxml==4.9.3 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from -r ../environment/requirements.txt (line 9)) (4.9.3)
Requirement already satisfied: tabulate==0.9.0 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from -r ../environment/requirements.txt (line 10)) (0.9.0)
Requirement already satisfied: matplotlib==3.7.1 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from -r ../environment/requirements.txt (line 11)) (3.7.1)
Requirement already satisfied: mapclassify==2.5.0 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from -r ../environment/requirements.txt (line 12)) (2.5.0)
Requirement already satisfied: folium in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from -r ../environment/requirements.txt (line 13)) (0.14.0)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from pandas==1.4.4->-r ../environment/requirements.txt (line 1)) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from pandas==1.4.4->-r ../environment/requirements.txt (line 1)) (2023.3.post1)
Requirement already satisfied: fiona>=1.8.19 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from geopandas==0.13.2->-r ../environment/requirements.txt (line 2)) (1.9.5)
Requirement already satisfied: packaging in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from geopandas==0.13.2->-r ../environment/requirements.txt (line 2)) (23.1)
Requirement already satisfied: pyproj>=3.0.1 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from geopandas==0.13.2->-r ../environment/requirements.txt (line 2)) (3.6.1)
Requirement already satisfied: shapely>=1.7.1 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from geopandas==0.13.2->-r ../environment/requirements.txt (line 2)) (2.0.2)
Requirement already satisfied: future in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from MDP==3.5->-r ../environment/requirements.txt (line 6)) (0.18.3)
Requirement already satisfied: requests in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from pygris==0.1.5->-r ../environment/requirements.txt (line 7)) (2.31.0)
Requirement already satisfied: appdirs in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from pygris==0.1.5->-r ../environment/requirements.txt (line 7)) (1.4.4)
Requirement already satisfied: pip in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from pygris==0.1.5->-r ../environment/requirements.txt (line 7)) (23.3)
Requirement already satisfied: rtree in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from pygris==0.1.5->-r ../environment/requirements.txt (line 7)) (1.1.0)
Requirement already satisfied: platformdirs in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from libpysal==4.7.0->-r ../environment/requirements.txt (line 8)) (3.10.0)
Requirement already satisfied: beautifulsoup4 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from libpysal==4.7.0->-r ../environment/requirements.txt (line 8)) (4.12.2)
Requirement already satisfied: jinja2 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from libpysal==4.7.0->-r ../environment/requirements.txt (line 8)) (3.1.2)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from matplotlib==3.7.1->-r ../environment/requirements.txt (line 11)) (1.1.1)
Requirement already satisfied: cycler>=0.10 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from matplotlib==3.7.1->-r ../environment/requirements.txt (line 11)) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from matplotlib==3.7.1->-r ../environment/requirements.txt (line 11)) (4.43.1)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from matplotlib==3.7.1->-r ../environment/requirements.txt (line 11)) (1.4.5)
Requirement already satisfied: pillow>=6.2.0 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from matplotlib==3.7.1->-r ../environment/requirements.txt (line 11)) (10.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from matplotlib==3.7.1->-r ../environment/requirements.txt (line 11)) (3.1.1)
Requirement already satisfied: importlib-resources>=3.2.0 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from matplotlib==3.7.1->-r ../environment/requirements.txt (line 11)) (6.1.0)
Requirement already satisfied: scikit-learn in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from mapclassify==2.5.0->-r ../environment/requirements.txt (line 12)) (1.3.2)
Requirement already satisfied: networkx in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from mapclassify==2.5.0->-r ../environment/requirements.txt (line 12)) (3.2)
Requirement already satisfied: branca>=0.6.0 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from folium->-r ../environment/requirements.txt (line 13)) (0.6.0)
Requirement already satisfied: attrs>=19.2.0 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from fiona>=1.8.19->geopandas==0.13.2->-r ../environment/requirements.txt (line 2)) (23.1.0)
Requirement already satisfied: certifi in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from fiona>=1.8.19->geopandas==0.13.2->-r ../environment/requirements.txt (line 2)) (2023.7.22)
Requirement already satisfied: click~=8.0 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from fiona>=1.8.19->geopandas==0.13.2->-r ../environment/requirements.txt (line 2)) (8.1.7)
Requirement already satisfied: click-plugins>=1.0 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from fiona>=1.8.19->geopandas==0.13.2->-r ../environment/requirements.txt (line 2)) (1.1.1)
Requirement already satisfied: cligj>=0.5 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from fiona>=1.8.19->geopandas==0.13.2->-r ../environment/requirements.txt (line 2)) (0.7.2)
Requirement already satisfied: six in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from fiona>=1.8.19->geopandas==0.13.2->-r ../environment/requirements.txt (line 2)) (1.16.0)
Requirement already satisfied: setuptools in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from fiona>=1.8.19->geopandas==0.13.2->-r ../environment/requirements.txt (line 2)) (68.0.0)
Requirement already satisfied: importlib-metadata in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from fiona>=1.8.19->geopandas==0.13.2->-r ../environment/requirements.txt (line 2)) (6.0.0)
Requirement already satisfied: zipp>=3.1.0 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from importlib-resources>=3.2.0->matplotlib==3.7.1->-r ../environment/requirements.txt (line 11)) (3.11.0)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from jinja2->libpysal==4.7.0->-r ../environment/requirements.txt (line 8)) (2.1.1)
Requirement already satisfied: soupsieve>1.2 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from beautifulsoup4->libpysal==4.7.0->-r ../environment/requirements.txt (line 8)) (2.5)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from requests->pygris==0.1.5->-r ../environment/requirements.txt (line 7)) (2.0.4)
Requirement already satisfied: idna<4,>=2.5 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from requests->pygris==0.1.5->-r ../environment/requirements.txt (line 7)) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from requests->pygris==0.1.5->-r ../environment/requirements.txt (line 7)) (1.26.18)
Requirement already satisfied: joblib>=1.1.1 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from scikit-learn->mapclassify==2.5.0->-r ../environment/requirements.txt (line 12)) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from scikit-learn->mapclassify==2.5.0->-r ../environment/requirements.txt (line 12)) (3.2.0)
Requirement already satisfied: colorama in c:\users\gshanleybarr\.conda\envs\spielmastudy\lib\site-packages (from click~=8.0->fiona>=1.8.19->geopandas==0.13.2->-r ../environment/requirements.txt (line 2)) (0.4.6)


# Import modules, define directories
import pygris
import pandas as pd
import geopandas as gpd
from pygris.data import get_census
from pygris import counties
from pyhere import here
import numpy as np
import libpysal as lps
import lxml
import tabulate
from scipy.stats import spearmanr
from scipy.stats.mstats import zscore as ZSCORE
from scipy.stats import rankdata
import mdp as MDP
from operator import itemgetter
import copy
from matplotlib.colors import ListedColormap
from matplotlib import patheffects as pe
import matplotlib.pyplot as plt
from IPython import display
from IPython.display import Markdown, Latex
import folium 

pd.set_option("chained_assignment", None)

path = {
    "dscr": here("data", "scratch"),
    "drpub": here("data", "raw", "public", "spielman", "input"),
    "drpub2": here("data", "raw", "public"),
    "drpriv": here("data", "raw", "private"),
    "ddpub": here("data", "derived", "public", "version1"),
    "ddpriv": here("data", "derived", "private"),
    "rfig": here("results", "figures"),
    "roth": here("results", "other"),
    "rtab": here("results", "tables"),
    "og_out": here("data", "raw", "public", "spielman", "output"),
    "dmet": here("data", "metadata")
}


# Switch from False to True to regenerate documentation of computational environment
# Note that this approach is not perfect -- it may miss some packages
# This code may work better from the command prompt
if False:
    !pip install pigar
    !pigar generate -f ../environment/requirements.txt


# Import data dictionary
acs_vars = pd.read_csv( here("data", "metadata", "ACS_2012_data_dictionary.csv") )
acs_vars.drop(columns=acs_vars.columns[0], axis=1, inplace=True)

acs_variables = list(acs_vars['Reproduction Label'][1:])
spielman_acs_variables = list(acs_vars['Spielman Label'][1:])


# Switch from False to True to download fresh data from the Census
if False:
    # Acquire attribute data for reproduction 
    counties_detailed = get_census(dataset = "acs/acs5", # dataset name on the Census API you are connecting to; find datasets at https://api.census.gov/data.html
                            variables = acs_variables, # string (or list of strings) of desired vars. For the 2021 5-year ACS Data Profile, those variable IDs are found at https://api.census.gov/data/2021/acs/acs5/profile/variables.html
                            year = 2012, # year of your data (or end-year for a 5-year ACS sample)
                            params = { # dict of query parameters to send to the API.
                              "for": "county:*"},
                            guess_dtypes = True,
                            return_geoid = True)

    # Drop Puerto Rico
    counties_detailed = counties_detailed.loc[~counties_detailed['GEOID'].str.startswith('72')]

    # Download and save raw data
    counties_detailed.to_csv( here(path["drpub2"], "counties_attributes_raw.csv"))
else:
    counties_detailed = pd.read_csv( here(path["drpub2"], "counties_attributes_raw.csv"), dtype = {'GEOID': object} )
    counties_detailed = counties_detailed.drop(counties_detailed.columns[0],axis=1)


# Import original data from Spielman et al.

# Import base ACS data
make_strings = {'Geo_FIPS': object, 'Geo_STATE': object, 'Geo_COUNTY': object,
                'Geo_TRACT': object, 'Geo_CBSA': object, 'Geo_CSA': object}

acs = pd.read_csv(here(path["drpub"], 'sovi_acs.csv'),
                  dtype=make_strings, skiprows=1,encoding='latin-1')

# Import, join an ACS supplemental
acs_sup2 = pd.read_csv(here(path["drpub"], 'sovi_acs_kids.csv'),
                           dtype=make_strings, skiprows=1,encoding='latin-1')

acs = acs.merge(acs_sup2, how = "inner", on='Geo_FIPS')

# Drop Puerto Rico
acs = acs[acs.Geo_STATE_x != '72']


Markdown( here(path["dmet"], "ACS_2012_geographic_metadata.md") )


acs_vars


# Import decennial supplemental
dec_sup1 = pd.read_csv(here(path["drpub"],'sovi_decennial_sup1.csv'),
        dtype=make_strings,skiprows=1,encoding='latin-1')


Markdown( here(path["dmet"], "dec_2010_metadata.md") )


spielman_geom = gpd.read_file( here(path["drpub"], "USA_Counties_500k.shp") )


Markdown( here(path["dmet"], 'USA_counties_metadata.md') )


# Switch False to True if you wish to acquire data directly from census
if False:
    # Acquire geographical data for reproduction
    counties_shp = counties(cb = True, year = 2010, cache = True) # year 2012 (and 2011) cartographic boundaries not available

    # Save raw data
    counties_shp.to_file( here(path["drpub2"], "counties_geometries_raw.gpkg") )
else:
    counties_shp = gpd.read_file( here(path["drpub2"], "counties_geometries_raw.gpkg") )


Markdown( here(path["dmet"], "county_geom_2010_metadata.md") )


# Step P1
# Join geometry and attribute data for reproduction 
counties_shp['GEOID'] = counties_shp.STATE + counties_shp.COUNTY
counties = counties_shp.merge(counties_detailed, how = "inner", on = "GEOID")

# Also join Spielman's land area information to the rest of Spielman's data
# (to check that all data is accurate, not for purposes of analysis)
acs = acs.merge(dec_sup1, how = "inner", on='Geo_FIPS')


# Define a function that can determine whether every entry in specified columns of two tables match  
def equiv(table1, sort1, column1, table2, sort2, column2):
    ''' 
    Tests two tables to see whether corresponding columns have equivalent entries.
    
    Parameters:
    table1 - the first table
    sort1 - the column in the first table to join by (str)
    column1 - the column(s) in the first table to test the values of (list of str) (should list analogous columns for columns2) 
    table2 - the second table
    sort2 - the column in the second table to join by (str)
    column2 - the column(s) in the second table to test the values of (list of str)
    '''
    # Sort tables
    table1 = table1.sort_values(by = sort1).reset_index()
    table2 = table2.sort_values(by = sort2).reset_index()
    
    # Rename column name in table2 to match that in table1
    for i in range(len(column1)):
        table2 = table2.rename(columns={column2[i]: column1[i]})

    # Select the columns to test equivalency of
    table1 = table1[column1]
    table2 = table2[column1]
    
    # Perform equivalency test
    test = table1.eq(table2)
    return test


# # Round area column
# acs['SE_T02A_002_check'] = acs.SE_T02A_002.round(0)
# counties['CENSUSAREA_check'] = counties.CENSUSAREA.round(0)

# Add the area variables to the lists of variables
acs_variables.append('CENSUSAREA')
spielman_acs_variables.append('SE_T02A_002')


# Perform equivalency test
test = equiv(counties, "GEOID", acs_variables, acs, "Geo_FIPS", spielman_acs_variables)
matching_cols = pd.DataFrame({"test": test.sum().eq(3143)}) # 3143 matches the number of rows 
matching_cols.loc[~matching_cols.test] # Identify the columns that have some data discrepencies


# Inspect the data values for B25077_001E at the indices with data discrepancies

# Find the rows for which there are data discrepancies
messed_up_indices = test[["B25077_001E"]].loc[~test.B25077_001E]

# Select the data of interest from tygris
tygris_data = counties.sort_values(by = "GEOID")\
    .reset_index().loc[messed_up_indices.index]\
    [["GEOID", "B25077_001E"]]

# Select the data of interest from Spielman et al.
spielman_data = acs.sort_values(by = "Geo_FIPS")\
    .reset_index().loc[messed_up_indices.index]\
    [["Geo_FIPS", "ACS12_5yr_B25077001"]]

# Join and inspect
merged = tygris_data.merge(spielman_data, how = "inner", left_on = "GEOID", right_on = "Geo_FIPS")
merged


# Inspect the data values for CENSUSAREA at the indices with data discrepancies

# Find the rows for which there are data discrepancies
messed_up_indices = test[["CENSUSAREA"]].loc[~test.CENSUSAREA]

# Select the data of interest from tygris
tygris_data = counties.sort_values(by = "GEOID")\
    .reset_index().loc[messed_up_indices.index]\
    [["GEOID", "CENSUSAREA"]]

# Select the data of interest from Spielman et al.
spielman_data = acs.sort_values(by = "Geo_FIPS")\
    .reset_index().loc[messed_up_indices.index]\
    [["Geo_FIPS", "SE_T02A_002"]]

# Join and inspect
merged = tygris_data.merge(spielman_data, how = "inner", left_on = "GEOID", right_on = "Geo_FIPS")
merged


merged["Difference"] = abs(merged["CENSUSAREA"] - merged["SE_T02A_002"])
print("Maximum difference:", merged['Difference'].max(), "\nAverage difference:", merged['Difference'].mean())

Maximum difference: 0.010999999998603016 
Average difference: 0.00031027379912322995


# Step P2
# Calculating the variables used in SoVI
counties['MEDAGE_ACS'] = counties.B01002_001E
counties['BLACK_ACS'] = counties.B03002_004E / (counties.B03002_001E)
counties['QNATAM_ACS'] = counties.B03002_005E / (counties.B03002_001E)
counties['QASIAN_ACS'] = counties.B03002_006E / (counties.B03002_001E)
counties['QHISP_ACS'] = counties.B03002_012E / (counties.B03002_001E)
counties['QAGEDEP_ACS'] = (counties.B06001_002E + counties.B09020_001E) / (counties.B01003_001E)
counties['QPUNIT_ACS'] = counties.B25008_001E / (counties.B25002_002E)
counties['PRENTER_ACS'] = counties.B25003_003E / (counties.B25002_001E)
counties['QNRRES_ACS'] = counties.B09020_021E / (counties.B01003_001E)
counties['QFEMALE_ACS'] = counties.B01001_026E / (counties.B01003_001E)
counties['QFHH_ACS'] = counties.B11001_006E / (counties.B11001_001E)
counties['QUNOCCHU_ACS'] = counties.B25002_003E / (counties.B25002_001E)
counties['QCVLUN'] = (counties.B23022_025E + counties.B23022_049E) / \
                counties.B23022_001E
counties['QPOVTY'] = (counties.B17021_002E) / counties.B17021_001E
counties['QMOHO'] = (counties.B25024_010E) / counties.B25024_001E
counties['QFEMLBR'] = (counties.C24010_038E) / counties.C24010_001E
counties['QSSBEN'] = (counties.B19055_002E) / counties.B19055_001E
counties['QFAM'] = (counties.B09002_002E) / counties.B09002_001E
counties['QRICH200K'] = (counties.B19001_017E) / counties.B11001_001E
counties['PERCAP_ALT'] = counties.B19025_001E / (counties.B25008_001E)
counties['QESL_ALT'] = (counties.B06007_005E + counties.B06007_008E) / \
                  counties.B06007_001E
counties['QED12LES_ALT'] = (counties.B16010_002E) / counties.B16010_001E 
counties['QEXTRCT_ALT'] = (counties.C24050_002E) / counties.C24050_001E 
counties['QSERV_ALT'] = (counties.C24050_029E) / counties.C24050_001E 
counties['QNOAUTO_ALT'] = (counties.B08201_002E) / counties.B08201_001E 
counties['MDGRENT_ALT'] = counties.B25064_001E 
counties['MHSEVAL_ALT'] = counties.B25077_001E 
counties['POPDENS'] = counties.B01003_001E / (counties.CENSUSAREA)


# Check for missing data
for i in counties.columns:
    x = counties[i].isnull().sum()
    if x > 0:
        print(i, "contains", x, "missing value(s).")
        
# Check for infinities
counties_num = counties.select_dtypes(include=['int64','float64'])
for i in counties_num.columns:
    xmin = counties_num[i].min()
    xmax = counties_num[i].max()
    if xmin == -np.inf:
        print(i, "contains a negative infinity")
    elif xmax == np.inf:
        print(i, "contains a positive infinity")

LSAD contains 2 missing value(s).
B25077_001E contains 1 missing value(s).
QFAM contains 2 missing value(s).
MHSEVAL_ALT contains 1 missing value(s).


# Step P3
# Replace missing QFAM data with 0
counties.QFAM = counties.QFAM.replace([np.inf, -np.inf, np.nan], 0)

# Replace missing MHSEVAL_ALT data with its spatial lag

# Calculate spatial weights matrix
w = lps.weights.Queen.from_dataframe(counties) 
w.transform = 'R'
# Calculate spatial lag
counties['MHSEVAL_ALT_LAG'] = lps.weights.lag_spatial(w, counties.MHSEVAL_ALT)
# Impute for the missing value
counties.MHSEVAL_ALT[np.isnan(counties['MHSEVAL_ALT'])] = counties[["MHSEVAL_ALT_LAG"]][pd.isna(counties['MHSEVAL_ALT'])]

('WARNING: ', 68, ' is an island (no neighbors)')
('WARNING: ', 546, ' is an island (no neighbors)')
('WARNING: ', 547, ' is an island (no neighbors)')
('WARNING: ', 549, ' is an island (no neighbors)')
('WARNING: ', 1226, ' is an island (no neighbors)')
('WARNING: ', 1876, ' is an island (no neighbors)')
('WARNING: ', 2976, ' is an island (no neighbors)')

C:\Users\gshanleybarr\.conda\envs\SpielmaStudy\lib\site-packages\libpysal\weights\weights.py:172: UserWarning: The weights matrix is not fully connected: 
 There are 10 disconnected components.
 There are 7 islands with ids: 68, 546, 547, 549, 1226, 1876, 2976.
  warnings.warn(message)


# Import Spielman et al.'s derived variables
US_All = pd.read_csv(here("data", "raw", "public", "spielman", "output", "sovi_inputs.csv"))
counties.to_csv(here(path["ddpub"],'counties.csv'))  
counties.to_file(here(path["ddpub"],'counties.gpkg'))  
counties = pd.read_csv(here(path["ddpub"], "counties.csv"), dtype = {'GEOID': object})


# Select only the relevant columns

# Attribute name and expected influence on vulnerability
input_names = [['MEDAGE_ACS', 'pos', 'person', 'Median Age'],
               ['BLACK_ACS', 'pos', 'person', 'Pop African-American (%)'],
               ['QNATAM_ACS', 'pos', 'person', 'Pop Native American (%)'],
               ['QASIAN_ACS', 'pos', 'person', 'Pop Asian (%)'],
               ['QHISP_ACS', 'pos', 'person', 'Pop Hispanic (%)'],
               ['QAGEDEP_ACS', 'pos', 'person', 'Age Dependency (%)'],
               ['QPUNIT_ACS', 'pos', 'person', 'Persons Per Housing Unit'],
               ['PRENTER_ACS', 'pos', 'hu', 'Rental Housing (%)'],
               ['QNRRES_ACS', 'pos', 'person', 'Nursing Home Residents (%)'],
               ['QFEMALE_ACS', 'pos', 'person', 'Pop Female (%)'],
               ['QFHH_ACS', 'pos', 'hu', 'Female-Headed Households (%)'],
               ['QUNOCCHU_ACS', 'pos', 'hu', 'Vacant Housing (%)'],
               ['PERCAP_ALT', 'neg', 'person', 'Per-Capita Income'],
               ['QESL_ALT', 'pos', 'person', 'English as Second Language (%)'],
               ['QCVLUN', 'pos', 'person', 'Unemployment (%)'],
               ['QPOVTY', 'pos', 'person', 'Poverty (%)'],
               ['QMOHO', 'pos', 'hu', 'Mobile Homes (%)'],
               ['QED12LES_ALT', 'pos', 'person',
                   'Adults Completed <Grade 12 (%)'],
               ['QFEMLBR', 'pos', 'person', 'Female Employment (%)'],
               ['QEXTRCT_ALT', 'pos', 'person',
                   'Extractive Sector Employment (%)'],
               ['QSERV_ALT', 'pos', 'person', 'Service Sector Employment (%)'],
               ['QSSBEN', 'pos', 'hu', 'Social Security Income (%)'],
               ['QNOAUTO_ALT', 'pos', 'hu', 'No Automobile (%)'],
               ['QFAM', 'neg', 'person', 'Children in Married Families (%)'],
               ['QRICH200K', 'neg', 'hu', 'Annual Income >$200K (%)'],
               ['MDGRENT_ALT', 'neg', 'hu', 'Median Rent'],
               ['MHSEVAL_ALT', 'neg', 'hu', 'Median Home Value'],
               ['POPDENS', 'pos', 'person', 'Population Density']]

# Get attribute names
attr_names1 = [j[0] for j in input_names] + ['GEOID']
attr_names2 = [j[0] for j in input_names] + ['Geo_FIPS']

# Select only the columns needed to compute SoVI
counties = counties[attr_names1]
US_All = US_All[attr_names2]

counties["GEOID"] = "g" + counties["GEOID"]
counties['stateID'] = counties.GEOID.str.slice(0, 3, 1)
attr_names1.remove('GEOID')
counties = counties.set_index(counties["GEOID"]).sort_index()

US_All['stateID'] = US_All.Geo_FIPS.str.slice(0, 3, 1)
attr_names2.remove('Geo_FIPS')
US_All = US_All.set_index(US_All["Geo_FIPS"]).sort_index()


counties.eq(US_All).sum()

BLACK_ACS       3143
GEOID              0
Geo_FIPS           0
MDGRENT_ALT     3143
MEDAGE_ACS      3143
MHSEVAL_ALT     3143
PERCAP_ALT      3143
POPDENS          853
PRENTER_ACS     3143
QAGEDEP_ACS     3143
QASIAN_ACS      3143
QCVLUN          3143
QED12LES_ALT    3143
QESL_ALT        3143
QEXTRCT_ALT     3143
QFAM            3143
QFEMALE_ACS     3143
QFEMLBR         3143
QFHH_ACS        3143
QHISP_ACS       3143
QMOHO           3143
QNATAM_ACS      3143
QNOAUTO_ALT     3143
QNRRES_ACS      3143
QPOVTY          3143
QPUNIT_ACS      3143
QRICH200K       3143
QSERV_ALT       3143
QSSBEN          3143
QUNOCCHU_ACS    3143
stateID         3143
dtype: int64


diff = (abs(counties[["POPDENS"]] - US_All[["POPDENS"]]))
print("Maximum difference:", diff.max()[0], "\nAverage difference:", diff.mean()[0])

Maximum difference: 0.949787960271351 
Average difference: 0.0015009727346509663


# Step P4
# Flip signs as needed to ensure that each variable contributes as expected to the final Sovi
for name, sign, sample, hrname in input_names:
    if sign == 'neg':
        counties[name] = -counties[name].values
        print("Inverting variable:", name)
    elif sign == 'pos':
        pass
    else:
        print("problem in flipping signs")
        raise

Inverting variable: PERCAP_ALT
Inverting variable: QFAM
Inverting variable: QRICH200K
Inverting variable: MDGRENT_ALT
Inverting variable: MHSEVAL_ALT


class SPSS_PCA:
	'''
	A class that integrates most (all?) of the assumptions SPSS imbeds in their
    implimnetation of principal components analysis (PCA), which can be found in
    thier GUI under Analyze > Dimension Reduction > Factor. This class is not
	intended to be a full blown recreation of the SPSS Factor Analysis GUI, but
	it does replicate (possibly) the most common use cases. Note that this class
	will not produce exactly the same results as SPSS, probably due to differences
	in how eigenvectors/eigenvalues and/or singular values are computed. However,
	this class does seem to get all the signs to match, which is not really necessary
	but kinda nice. Most of the approach came from the official SPSS documentation.

	References
	----------
	ftp://public.dhe.ibm.com/software/analytics/spss/documentation/statistics/20.0/en/client/Manuals/IBM_SPSS_Statistics_Algorithms.pdf
	http://spssx-discussion.1045642.n5.nabble.com/Interpretation-of-PCA-td1074350.html
	http://mdp-toolkit.sourceforge.net/api/mdp.nodes.WhiteningNode-class.html
	https://github.com/mdp-toolkit/mdp-toolkit/blob/master/mdp/nodes/pca_nodes.py

	Parameters
	----------
	inputs:  numpy array
			 n x k numpy array; n observations and k variables on each observation
	reduce:  boolean (default=False)
			 If True, then use eigenvalues to determine which factors to keep; all
			 results will be based on just these factors. If False use all factors.
	min_eig: float (default=1.0)
			 If reduce=True, then keep all factors with an eigenvalue greater than
			 min_eig. SPSS default is 1.0. If reduce=False, then min_eig is ignored.
	varimax: boolean (default=False)
			 If True, then apply a varimax rotation to the results. If False, then
			 return the unrotated results only.

	Attributes
	----------
	z_inputs:	numpy array
				z-scores of the input array.
	comp_mat:	numpy array
				Component matrix (a.k.a, "loadings").
	scores:		numpy array
				New uncorrelated vectors associated with each observation.
	eigenvals_all:	numpy array
				Eigenvalues associated with each factor.
	eigenvals:	numpy array
				Subset of eigenvalues_all reflecting only those that meet the
				criterion defined by parameters reduce and min_eig.
	weights:    numpy array
				Values applied to the input data (after z-scores) to get the PCA
				scores. "Component score coefficient matrix" in SPSS or
				"projection matrix" in the MDP library.
	comms: 		numpy array
				Communalities
	sum_sq_load: numpy array
				 Sum of squared loadings.
	comp_mat_rot: numpy array or None
				  Component matrix after rotation. Ordered from highest to lowest
				  variance explained based on sum_sq_load_rot. None if varimax=False.
	scores_rot:	numpy array or None
				Uncorrelated vectors associated with each observation, after
				rotation. None if varimax=False.
	weights_rot: numpy array or None
				Rotated values applied to the input data (after z-scores) to get
				the PCA	scores. None if varimax=False.
	sum_sq_load_rot: numpy array or None
				 Sum of squared loadings for rotated results. None if
				 varimax=False.

	'''

	def __init__(self, inputs, reduce=False, min_eig=1.0, varimax=False):
        # Step S1
		z_inputs = ZSCORE(inputs)  # seems necessary for SPSS "correlation matrix" setting (their default)
        
        # The rest is step S2
		# run base SPSS-style PCA to get all eigenvalues
		pca_node = MDP.nodes.WhiteningNode()  # settings for the PCA
		scores = pca_node.execute(z_inputs)  # base run PCA
		eigenvalues_all = pca_node.d   # rename PCA results

		# run SPSS-style PCA based on user settings
		pca_node = MDP.nodes.WhiteningNode(reduce=reduce, var_abs=min_eig)  # settings for the PCA
		scores = pca_node.execute(z_inputs)  # run PCA  (these have mean=0, std_dev=1)
		weights = pca_node.v  # rename PCA results (these might be a transformation of the eigenvectors)
		eigenvalues = pca_node.d   # rename PCA results
		component_matrix = weights * eigenvalues  # compute the loadings
		component_matrix = self._reflect(component_matrix)   # get signs to match SPSS
		communalities = (component_matrix**2).sum(1)   # compute the communalities
		sum_sq_loadings = (component_matrix**2).sum(0) # note that this is the same as eigenvalues
		weights_reflected = component_matrix/eigenvalues  # get signs to match SPSS
		scores_reflected = np.dot(z_inputs, weights_reflected)  # note that abs(scores)=abs(scores_reflected)

		if varimax:
			# SPSS-style varimax rotation prep
			c_normalizer = 1. / MDP.numx.sqrt(communalities)  # used to normalize inputs to varimax
			c_normalizer.shape = (component_matrix.shape[0],1)  # reshape to vectorize normalization
			cm_normalized = c_normalizer * component_matrix  # normalize component matrix for varimax

			# varimax rotation
			cm_normalized_varimax = self._varimax(cm_normalized)  # run varimax
			c_normalizer2 = MDP.numx.sqrt(communalities)  # used to denormalize varimax output
			c_normalizer2.shape = (component_matrix.shape[0],1)  # reshape to vectorize denormalization
			cm_varimax = c_normalizer2 * cm_normalized_varimax  # denormalize varimax output

			# reorder varimax component matrix
			sorter = (cm_varimax**2).sum(0)  # base the ordering on sum of squared loadings
			sorter = zip(sorter.tolist(), range(sorter.shape[0]))  # add index to denote current order
			sorter = sorted(sorter, key=itemgetter(0), reverse=True)  # sort from largest to smallest
			sum_sq_loadings_varimax, reorderer = zip(*sorter)  # unzip the sorted list
			sum_sq_loadings_varimax = np.array(sum_sq_loadings_varimax)  # convert to array
			cm_varimax = cm_varimax[:,reorderer]  # reorder component matrix

			# varimax scores
			cm_varimax_reflected = self._reflect(cm_varimax)  # get signs to match SPSS
			varimax_weights = np.dot(cm_varimax_reflected,
							  np.linalg.inv(np.dot(cm_varimax_reflected.T,
							  cm_varimax_reflected))) # CM(CM'CM)^-1
			scores_varimax = np.dot(z_inputs, varimax_weights)
		else:
			comp_mat_rot = None
			scores_rot = None
			weights_rot = None

		# assign output variables
		self.z_inputs = z_inputs
		self.scores = scores_reflected
		self.comp_mat = component_matrix
		self.eigenvals_all = eigenvalues_all
		self.eigenvals = eigenvalues
		self.weights = weights_reflected
		self.comms = communalities
		self.sum_sq_load = sum_sq_loadings
		self.comp_mat_rot = cm_varimax_reflected
		self.scores_rot = scores_varimax # PCA scores output
		self.weights_rot = varimax_weights # PCA weights output
		self.sum_sq_load_rot = sum_sq_loadings_varimax

	def _reflect(self, cm):
		# reflect factors with negative sums; SPSS default
		cm = copy.deepcopy(cm)
		reflector = cm.sum(0)
		for column, measure in enumerate(reflector):
			if measure < 0:
				cm[:,column] = -cm[:,column]
		return cm

	def _varimax(self, Phi, gamma = 1.0, q = 100, tol = 1e-6):
		# downloaded from http://en.wikipedia.org/wiki/Talk%3aVarimax_rotation
		# also here http://stackoverflow.com/questions/17628589/perform-varimax-rotation-in-python-using-numpy
		p,k = Phi.shape
		R = np.eye(k)
		d=0
		for i in range(q):
			d_old = d
			Lambda = np.dot(Phi, R)
			u,s,vh = np.linalg.svd(np.dot(Phi.T,np.asarray(Lambda)**3 - (gamma/p) *
							np.dot(Lambda, np.diag(np.diag(np.dot(Lambda.T,Lambda))))))
			R = np.dot(u,vh)
			d = np.sum(s)
			if d_old!=0 and d/d_old < 1 + tol:
				break
		return np.dot(Phi, R)


# Build FEMA subRegions Dict values= state ID's
FEMA_subs = dict()
FEMA_subs['FEMA_1'] = ['g23g33g25', 'g50', 'g09', 'g44']
FEMA_subs['FEMA_2'] = ['g36', 'g34']
FEMA_subs['FEMA_3'] = ['g42', 'g10', 'g11', 'g24', 'g51', 'g54']
FEMA_subs['FEMA_4'] = ['g21', 'g47', 'g37', 'g28', 'g01', 'g13', 'g45', 'g12']
FEMA_subs['FEMA_5'] = ['g27', 'g55', 'g26', 'g17', 'g18', 'g39']
FEMA_subs['FEMA_6'] = ['g35', 'g48', 'g40', 'g05', 'g22']
FEMA_subs['FEMA_7'] = ['g31', 'g19', 'g20', 'g29']
FEMA_subs['FEMA_8'] = ['g30', 'g38', 'g56', 'g46', 'g49', 'g08']
FEMA_subs['FEMA_9'] = ['g06', 'g32', 'g04']
FEMA_subs['FEMA_10'] = ['g53', 'g41', 'g16']

####################################
# DataFrames to hold US, FEMA region, and state level results
####################################

# Dict to hold variable loadings
# key will be [USA, Fema_region, stateid] depending on level of analysis
varContrib = {}

# National Score
US_Sovi_Score = pd.DataFrame(index=counties.GEOID,
                             columns=['sovi', 'rank'])

# In the FEMA_Region_Sovi_Score data frame ranks are BY FEMA REGION.
# The data frame holds both the SOVI score and the county rank
# This means that there should be 10 counties with rank 1 (one for each
# FEMA Region)
FEMA_Region_Sovi_Score = pd.DataFrame(index=counties.GEOID,
                                      columns=['sovi', 'rank', 'fema_region'])

# Create New England conglomerate of states
# These are the FIPS codes for the states with the letter "g" appended
counties.loc[counties.stateID.isin(['g23', 'g33', 'g25']), 'stateID'] = 'g23g33g25'

# These are the states in the state level analysis
stateList = ['g23g33g25', 'g36', 'g51', 'g13',
             'g17', 'g48', 'g29', 'g46', 'g06', 'g16']

# In the State_Sovi_Score data frame ranks are BY STATE.
# The data frame holds both the SOVI score and the county rank
# This means that there should be 10 counties with rank 1 (one for each
# state in stateList)
State_Sovi_Score = pd.DataFrame(
    index=counties.index[counties.stateID.isin(stateList)],
    columns=['sovi', 'rank', 'state_id'])


#######################
# Compute National SoVI --- Origninal Methods
#######################
# compute SoVI
# Step M2
inputData = counties.drop(['GEOID', 'stateID'], axis=1, inplace=False)

# Step M3
inputData_array = inputData.values  # Convert DataFrame to NumPy array

# Step M4
pca = SPSS_PCA(inputData_array, reduce=True, varimax=True)

# Step M5
sovi_actual_us = pca.scores_rot.sum(1)

# Step M6
sovi_actual_us = pd.DataFrame(
    sovi_actual_us, index=counties.GEOID, columns=['sovi'])

# Step M8
sovi_actual_us['rank'] = sovi_actual_us.rank(
    method='average', ascending=False)
US_Sovi_Score.update(sovi_actual_us)

# Step M9
attrib_contribution_us = pca.weights_rot.sum(1)
varContrib['USA'] = zip(attr_names1, attrib_contribution_us.tolist()) # Generate dictionary for all net loadings by variable for US

# Quick check of ranks: max should equal number of counties in US
try:
    US_Sovi_Score['rank'].max() == len(counties)
except:
    print("error in ranking check")
    raise

# cleanup
del inputData
# del inputData_norm
del sovi_actual_us
del attrib_contribution_us


# Make Table of Outputs - Output table of factor loadings for National SoVI (weights_rot)

pca.weights_rot
pd.DataFrame(pca.weights_rot, index=attr_names1)


#######################
# Compute National SoVI --- Reweighted Model 
#######################
# compute SoVI
# Step M2
#inputData = counties.drop(['GEOID', 'stateID'], axis=1, inplace=False)

# Step M3
#inputData_array = inputData.values  # Convert DataFrame to NumPy array

# Step M4
#pca = SPSS_PCA(inputData_array, reduce=True, varimax=True)

# variance explained for each component is the eigenvalue divided by the sum of all eigenvalues
# the eigenvalue is equivalent to the sum of squared loadings
#variance_explained = pca.sum_sq_load_rot / pca.eigenvals_all.sum()
#print("Percent variance explained by each component")
#print(variance_explained)

# ------------------------------------------------------
# Step M5
# get calculated Principal Component values from PCA
#sovi_actual_us = pca.scores_rot

# weight both loadings and component values by variance explained
#for i in range(0, len(variance_explained)):
    #sovi_actual_us[i] = sovi_actual_us[i] * (variance_explained[i])
   # loadings[i] = loadings[i] * variance_explained[i]

# Step M6
#sovi_actual_us = pd.DataFrame(
   # sovi_actual_us, index=counties.GEOID)

# sum reweighted components
#sovi_actual_us['sovi'] = sovi_actual_us.sum(axis=1)

# --------------------------------------------------------
# Step M5
# get calculated Principal Component values from PCA
#sovi_actual_us = pca.scores_rot

# weight both loadings and component values by variance explained
#for i in range(0, len(variance_explained)):
    #sovi_actual_us[i] = sovi_actual_us[i] * (variance_explained[i])
    #loadings[i] = loadings[i] * variance_explained[i]

# Step M6
#sovi_actual_us = pd.DataFrame(
 #   sovi_actual_us, index=counties.GEOID)

# sum reweighted components
#sovi_actual_us['sovi'] = sovi_actual_us.sum(axis=1)

#----------------------------------------------------------
# Step M8
#sovi_actual_us['rank'] = sovi_actual_us['sovi'].rank(
#    method='average', ascending=False)
#US_Sovi_Score.update(sovi_actual_us)

# Step M9
#attrib_contribution_us = loadings.sum(1)
#varContrib['USA'] = zip(attr_names1, attrib_contribution_us.tolist()) # Generate dictionary for all net loadings by variable for US

# Quick check of ranks: max should equal number of counties in US
#try:
  #  US_Sovi_Score['rank'].max() == len(counties)
#except:
   # print("error in ranking check")
   # raise

# cleanup
#del inputData
# del inputData_norm

# del sovi_actual_us
#del attrib_contribution_us

# ------------------------------------------------------
#national = pd.DataFrame(pca.scores_rot)
#national


national = pd.DataFrame(pca.scores_rot)
national


# Calculate the percent varience explained - each factor has one eigen value representing amount of factor rotation 
pct_val_explained = pca.sum_sq_load_rot/sum(pca.eigenvals_all)
pd.DataFrame(pct_val_explained)


###########################
# Compute FEMA Region SoVI
###########################
for i in FEMA_subs:

    # Step M1: Subset FEMA subregion
    FEMARegionData = counties[counties['stateID'].isin(FEMA_subs[i])]

    # Step M2
    inputData = FEMARegionData.drop(
        ['GEOID', 'stateID'], axis=1, inplace=False)
    
    # Step M3
    inputData_array = inputData.values  # Convert DataFrame to NumPy array
    
    # Step M4
    pca = SPSS_PCA(inputData_array, reduce=True, varimax=True)
    
    # Step M5
    sovi_actual_fema = pca.scores_rot.sum(1)

    # Step M6
    sovi_actual_fema = pd.DataFrame(
        sovi_actual_fema, index=FEMARegionData.index, columns=['sovi'])
    
    # Step M7
    sovi_actual_fema['fema_region'] = i
    
    # Step M8
    sovi_actual_fema['rank'] = sovi_actual_fema['sovi'].rank(
        method='average', ascending=False)
    FEMA_Region_Sovi_Score.update(sovi_actual_fema)
    
    # Step M9
    attrib_contribution_fema = pca.weights_rot.sum(1)
    varContrib[i] = zip(attr_names1, attrib_contribution_fema.tolist())

# cleanup
del FEMARegionData
del inputData
del sovi_actual_fema
del attrib_contribution_fema


############################
# Compute State Level SoVI
############################
for st in stateList:
    
    # Step M1: Subset FEMA subregion
    stateData = counties[counties.stateID == st]

    # Step M2
    inputData = stateData.drop(['GEOID', 'stateID'], axis=1, inplace=False)
    
    # Step M3
    inputData_array = inputData.values  # Convert DataFrame to NumPy array
    
    # Step M4
    pca = SPSS_PCA(inputData_array, reduce=True, varimax=True)
    
    # Step M5
    sovi_actual = pca.scores_rot.sum(1)
    
    # Step M6
    sovi_actual = pd.DataFrame(
        sovi_actual, index=stateData.index, columns=['sovi'])
    
    # Step M7
    sovi_actual['state_id'] = st
    
    # Step M8
    sovi_actual['rank'] = sovi_actual['sovi'].rank(
        method='average', ascending=False)
    State_Sovi_Score.update(sovi_actual)
    
    # Step M9
    attrib_contribution = pca.weights_rot.sum(1)
    varContrib[st] = zip(attr_names1, attrib_contribution.tolist())

# cleanup
del stateData
del inputData
del sovi_actual
del attrib_contribution


##########################################################################
# Ranks w/ Geographic Extent
# For each county rank within state for US, state, and fema_region sovis
##########################################################################

# Step IC1: Create an empty DataFrame with a column for each SoVI spatial extent and an index of each county FIPS code in the selected state(s) of the 10 FEMA regions
county_in_state_rank = pd.DataFrame(index=State_Sovi_Score.index,
                                    columns=['state_sovi_rank', 'fema_region_sovi_rank', 'us_sovi_rank'])

for st in stateList:
    if st == 'g23g33g25':
       # Step IC2: Select the index and SoVI scores from national model for Maine, New Hampshire, and Massachusetts
        st_cty_scores1 = US_Sovi_Score.loc[['g23' in s for s in US_Sovi_Score.index], 'sovi']
        st_cty_scores2 = US_Sovi_Score.loc[['g33' in s for s in US_Sovi_Score.index], 'sovi']
        st_cty_scores3 = US_Sovi_Score.loc[['g25' in s for s in US_Sovi_Score.index], 'sovi']
        st_cty_scores = pd.concat([st_cty_scores1, st_cty_scores2, st_cty_scores3])

        # Step IC3: Re-rank the national SoVI scores but just for the counties in the relevant states
        county_in_state_rank.loc[st_cty_scores.index, 'us_sovi_rank'] = st_cty_scores.rank(method='average', ascending=False)

        # Step IC4: Select the index and SoVI scores from FEMA model for Maine, New Hampshire, and Massachusetts
        st_cty_scores1 = FEMA_Region_Sovi_Score.loc[['g23' in s for s in FEMA_Region_Sovi_Score.index], 'sovi']
        st_cty_scores2 = FEMA_Region_Sovi_Score.loc[['g33' in s for s in FEMA_Region_Sovi_Score.index], 'sovi']
        st_cty_scores3 = FEMA_Region_Sovi_Score.loc[['g25' in s for s in FEMA_Region_Sovi_Score.index], 'sovi']
        st_cty_scores = pd.concat([st_cty_scores1, st_cty_scores2, st_cty_scores3])

        # Step IC5: Re-rank the FEMA SoVI scores but just for the counties in the relevant states
        county_in_state_rank.loc[st_cty_scores.index, 'fema_region_sovi_rank'] = st_cty_scores.rank(method='average', ascending=False)

        # Step IC6: Pull the state-only SoVI ranks into the same dataframe as the other data
        county_in_state_rank.loc[st_cty_scores.index, 'state_sovi_rank'] = State_Sovi_Score.loc[State_Sovi_Score['state_id'] == 'g23g33g25', 'rank']

    else:
        # Step IC2: select the index and SoVI scores from national model for the relevant state
        st_cty_scores = US_Sovi_Score.loc[[st in s for s in US_Sovi_Score.index], 'sovi']

        # Step IC3: Re-rank the national SoVI scores but just for the counties in the relevant state
        county_in_state_rank.loc[st_cty_scores.index, 'us_sovi_rank'] = st_cty_scores.rank(method='average', ascending=False)

        # Step IC4: Select the index and SoVI scores from FEMA model for the relevant state
        st_cty_scores = FEMA_Region_Sovi_Score.loc[[st in s for s in FEMA_Region_Sovi_Score.index], 'sovi']

        # Step IC5: Re-rank the FEMA SoVI scores but just for the counties in the relevant state
        county_in_state_rank.loc[st_cty_scores.index, 'fema_region_sovi_rank'] = st_cty_scores.rank(method='average', ascending=False)

        # Step IC6: Pull the state-only SoVI ranks into the same dataframe as the other data
        county_in_state_rank.loc[st_cty_scores.index, 'state_sovi_rank'] = State_Sovi_Score.loc[State_Sovi_Score['state_id'] == st, 'rank']


######################
# CORRELATIONS
######################

# Step IC 7: Create an empty DataFrame to hold Spearman test results
state_corrs = pd.DataFrame(index = stateList, columns = ['spearman_r_st_fema', 'pvalue_st_fema', 'spearman_r_st_us', 'pvalue_st_us'])

for st in stateList:
  if st == 'g23g33g25':
    # Step IC8: Calculate spearman correlation between state and FEMA, state and national
    multi_state_data_tmp1 = county_in_state_rank.loc[['g23' in s for s in county_in_state_rank.index], ]
    multi_state_data_tmp2 = county_in_state_rank.loc[['g25' in s for s in county_in_state_rank.index], ]
    multi_state_data_tmp3 = county_in_state_rank.loc[['g33' in s for s in county_in_state_rank.index], ]
    multi_state_data_tmp = pd.concat([multi_state_data_tmp1, multi_state_data_tmp2, multi_state_data_tmp3])
    st_fema_spearman = spearmanr(multi_state_data_tmp[['state_sovi_rank', 'fema_region_sovi_rank']])
    st_us_spearman = spearmanr(multi_state_data_tmp[['state_sovi_rank', 'us_sovi_rank']])
    state_corrs.loc['g23g33g25', ] = [st_fema_spearman[0], st_fema_spearman[1], st_us_spearman[0], st_us_spearman[1]]
  else:
    # Step IC8: Calculate spearman correlation between state and FEMA, state and national
    st_fema_spearman = spearmanr(county_in_state_rank.loc[[st in s for s in county_in_state_rank.index], ['state_sovi_rank', 'fema_region_sovi_rank']])
    st_us_spearman = spearmanr(county_in_state_rank.loc[[st in s for s in county_in_state_rank.index], ['state_sovi_rank', 'us_sovi_rank']])
    state_corrs.loc[st, ] = [st_fema_spearman[0], st_fema_spearman[1], st_us_spearman[0], st_us_spearman[1]]


# Step TC1: Create a DataFrame to hold variable contributions values
variable_contributions = pd.DataFrame(index=attr_names1)

# Step TC2: Add variable contributions values to DataFrame
for area in varContrib.keys():
    variable_contributions[area] = [x for i, x in varContrib[area]]
    
# Step TC3: For all SoVI models, rank variables from the greatest to the least magnitudes
rankContrib = abs(variable_contributions).apply(rankdata, axis=0, method='average')
rankContrib = (28-rankContrib) + 1

# Step TC4: Sort variable rankings according to national model's most to least important
rankContrib = rankContrib.sort_values("USA", ascending = True).reset_index()
rankContrib.index = rankContrib["index"]
rankContrib = rankContrib.drop(columns = ["index"])

# Step TC5: Calculate summary statistics for each variable
summary_stats = pd.DataFrame( {"Min": rankContrib.min(axis = 1).round(),
                               "Max": rankContrib.max(axis = 1).round(),
                               "Range": rankContrib.max(axis = 1) - rankContrib.min(axis = 1).round(),
                               "Average": rankContrib.mean(axis = 1).round(2)
                              } )

# Step TC6: determine signs of USA model
def pos_neg(x):
    if x > 0:
        return "+"
    else:
        return "-"
usa = variable_contributions["USA"].apply(pos_neg)

# Step TC7: Determine all positive/negatives
reversals_adj = variable_contributions < 0

# Step TC8: Separate data 
reference = reversals_adj[["USA"]]
other_vars = reversals_adj.drop(columns = ["USA"])

# Step TC9: Determine all reversals from expected sign
for i in range(len(other_vars.columns)):
    other_vars.iloc[:, i] = reversals_adj["USA"].eq(other_vars.iloc[:, i]).eq(False)
    
# Step TC10: calculate reversals
reversal_sum = pd.DataFrame( {"Reversals": other_vars.sum(axis = 1)} )

# Step TC11: Join data
summary_stats = summary_stats.merge(reversal_sum, left_index = True, right_index = True)

# Step TC12: Join data
summary_stats = summary_stats.merge(usa, left_index = True, right_index = True)

# Step TC13: Change index labels to reflect any changes prior to SoVI calculation
for name, sign, sample, hrname in input_names:
    if sign == 'neg':
        summary_stats = summary_stats.rename(index={name: '- '+ name})
    else:
        pass

# Table aesthetics edits
summary_stats = summary_stats.rename(columns={"USA": "National Model"})
summary_stats = summary_stats.loc[:,['National Model', 'Reversals', 'Min', 'Average', 'Max', 'Range']]


#####################################################
# OUTPUT TABLES
#####################################################
US_Sovi_Score.to_csv( here(path["ddpub"], 'US_Sovi_Score.csv') )


# In the FEMA_Region_Sovi_Score data frame ranks are BY FEMA REGION.
# The data frame holds both the SOVI score and the county rank
# This means that there should be 10 counties with rank 1 (one for each
# FEMA Region)
FEMA_Region_Sovi_Score.to_csv( here(path["ddpub"], 'FEMA_Region_Sovi_Score.csv') )

# In the State_Sovi_Score data frame ranks are BY STATE.
# The data frame holds both the SOVI score and the county rank
# This means that there should be 10 counties with rank 1 (one for each
# state in stateList)
State_Sovi_Score.to_csv( here(path["ddpub"], 'State_Sovi_Score.csv') )

# County rank within state for US, state, and fema_region sovis
county_in_state_rank.to_csv( here(path["ddpub"], 'County_in_State_Rank.csv') )

# Variable contributions for sovis at all geographic extents
variable_contributions.to_csv( here(path["ddpub"], 'variable_contributions.csv') )

# Correlation of ranks
state_corrs.to_csv( here(path["ddpub"], 'state_fema_us_rank_correlations.csv') )


def check_it(file, rounder = False):
    '''
    Given a file name, this function finds the corresponding file provided by Spielman et al. and the file produced
    by our code and returns the number of matches for each column.    
    '''
    global rpl
    global og
    global test
    
    rpl = pd.read_csv( here(path["ddpub"], file) )
    og = pd.read_csv( here(path["og_out"], file) )
    og = og.rename(columns = {"Geo_FIPS": "GEOID"})
    
    if "sovi" in rpl.columns:
        rpl["sovi"] = rpl["sovi"].round(2)
        og["sovi"] = og["sovi"].round(2)
    
    if "Unnamed: 0" in rpl.columns:
        rpl.index = rpl["Unnamed: 0"]
        rpl = rpl.drop(columns = ["Unnamed: 0"])
        
    if "Unnamed: 0" in og.columns:
        og.index = og["Unnamed: 0"]
        og = og.drop(columns = ["Unnamed: 0"])
        
    if rounder != False:
        og = og.round(rounder)
        rpl = rpl.round(rounder)
        
    test = rpl.eq(og)
    
    if test.sum().eq(len(rpl)).sum() == len(test.sum()):
        return print("All values match!")
    else:
        return test.sum()


check_it('US_Sovi_Score.csv')

GEOID    3143
sovi     3143
rank     3141
dtype: int64


merged = og.merge(rpl, how = "inner", on = "GEOID")
merged.loc[~test["rank"]]


check_it('FEMA_Region_Sovi_Score.csv')

GEOID          3143
sovi           3109
rank           3109
fema_region    3109
dtype: int64


merged = og.merge(rpl, how = "inner", on = "GEOID")
merged.loc[~test["rank"] | ~test["sovi"] | ~test["fema_region"]]#.head()


check_it('State_Sovi_Score.csv')

All values match!


check_it("County_in_State_Rank.csv")

All values match!


check_it("variable_contributions.csv", rounder = 3)

All values match!


check_it("state_fema_us_rank_correlations.csv", rounder = 14)

All values match!


# Read files
counties = gpd.read_file( here(path["ddpub"], "counties.gpkg") )
USA = pd.read_csv( here(path["ddpub"], "US_Sovi_Score.csv") ).rename( columns={"sovi": "sovi_USA"} )
FEMA = pd.read_csv( here(path["ddpub"], "FEMA_Region_Sovi_Score.csv") ).rename( columns={"sovi": "sovi_FEMA"} )
CA = pd.read_csv( here(path["ddpub"], "State_Sovi_Score.csv") ).rename( columns={"sovi": "sovi_CA"} )


# Edit counties GEOID to match other datasets
counties["GEOID"] = "g" + counties["GEOID"]

# Select just the rows and columns needed
counties_CA = counties.loc[counties["STATE"] == "06"]
counties_CA = counties_CA[["GEOID", "geometry"]]

# Join all datasets
counties_CA = counties_CA.merge(USA, on = "GEOID")
counties_CA = counties_CA.merge(FEMA, on = "GEOID")
counties_CA = counties_CA.merge(CA, on = "GEOID")

counties_CA['rank_USA'] = counties_CA['sovi_USA'].rank(method='average', ascending=False)
counties_CA['rank_FEMA'] = counties_CA['sovi_FEMA'].rank(method='average', ascending=False)
counties_CA['rank_CA'] = counties_CA['sovi_CA'].rank(method='average', ascending=False)

mycolor = ListedColormap('#DBDBDB')


# Create overarching plot
fig, ax = plt.subplots(1, 4, figsize=(20, 8))

fig.tight_layout()

ax[0].axis('off')
ax[1].axis('off')
ax[2].axis('off')
ax[3].axis('off');

# Create CA rank map
top5_CA = counties_CA.loc[counties_CA["rank_CA"] < 6]
bottom5_CA = counties_CA.loc[counties_CA["rank_CA"] > 53]
ax[0].set_title("(a) California Analysis")
counties_CA.plot(ax = ax[0], cmap = mycolor, edgecolor = 'black', linewidth = .1)
top5_CA.plot(ax = ax[0], column = "rank_CA", cmap = "Reds_r")
top5_CA.apply(lambda x: ax[0].text(s=round(x['rank_CA']), color = 'black', x=x.geometry.centroid.coords[0][0], y=x.geometry.centroid.coords[0][1], ha='center', path_effects=[pe.withStroke(linewidth=1.5, foreground="white")]), axis=1, );
bottom5_CA.plot(ax = ax[0], column = "rank_CA", cmap = "Blues");
bottom5_CA.apply(lambda x: ax[0].text(s=round(x['rank_CA']), color = 'black', x=x.geometry.centroid.coords[0][0], y=x.geometry.centroid.coords[0][1], ha='center',  path_effects=[pe.withStroke(linewidth=1.5, foreground="white")]), axis=1, );
# Create FEMA rank map
top5_CA = counties_CA.loc[counties_CA["rank_FEMA"] < 6]
bottom5_CA = counties_CA.loc[counties_CA["rank_FEMA"] > 53]
ax[1].set_title("(b) FEMA Region IX Analysis")
counties_CA.plot(ax = ax[1], cmap = mycolor, edgecolor = 'black', linewidth = .1)
top5_CA.plot(ax = ax[1], column = "rank_FEMA", cmap = "Reds_r")
top5_CA.apply(lambda x: ax[1].text(s=round(x['rank_FEMA']), color = 'black', x=x.geometry.centroid.coords[0][0], y=x.geometry.centroid.coords[0][1], ha='center', path_effects=[pe.withStroke(linewidth=1.5, foreground="white")]), axis=1, );
bottom5_CA.plot(ax = ax[1], column = "rank_FEMA", cmap = "Blues");
bottom5_CA.apply(lambda x: ax[1].text(s=round(x['rank_FEMA']), color = 'black', x=x.geometry.centroid.coords[0][0], y=x.geometry.centroid.coords[0][1], ha='center',  path_effects=[pe.withStroke(linewidth=1.5, foreground="white")]), axis=1, );
# Create USA rank map
top5_CA = counties_CA.loc[counties_CA["rank_USA"] < 6]
bottom5_CA = counties_CA.loc[counties_CA["rank_USA"] > 53]
ax[2].set_title("(c) United States Analysis")
counties_CA.plot(ax = ax[2], cmap = mycolor, edgecolor = 'black', linewidth = .1)
top5_CA.plot(ax = ax[2], column = "rank_USA", cmap = "Reds_r")
top5_CA.apply(lambda x: ax[2].text(s=round(x['rank_USA']), color = 'black', x=x.geometry.centroid.coords[0][0], y=x.geometry.centroid.coords[0][1], ha='center', path_effects=[pe.withStroke(linewidth=1.5, foreground="white")]), axis=1, );
bottom5_CA.plot(ax = ax[2], column = "rank_USA", cmap = "Blues");
bottom5_CA.apply(lambda x: ax[2].text(s=round(x['rank_USA']), color = 'black', x=x.geometry.centroid.coords[0][0], y=x.geometry.centroid.coords[0][1], ha='center',  path_effects=[pe.withStroke(linewidth=1.5, foreground="white")]), axis=1, );
# Create range rank map
ax[3].set_title("(d) Range of SoVI Rankings")
counties_CA["min_rank"] = counties_CA[["rank_USA", "rank_FEMA", "rank_CA"]].min(axis = 1)
counties_CA["max_rank"] = counties_CA[["rank_USA", "rank_FEMA", "rank_CA"]].max(axis = 1)
counties_CA["range_rank"] = counties_CA["max_rank"] - counties_CA["min_rank"]
counties_CA.plot(ax = ax[3], column = "range_rank", cmap = "Reds", edgecolor = 'black', linewidth = .1, scheme="User_Defined", 
         legend=True, classification_kwds=dict(bins=[5,15,25,35,45]));
plt.savefig( here(path["rfig"], 'fig1.png') ) # Save image


from IPython.display import Image
weighted_fig1 = Image( here(path["rfig"], 'fig1_weighted.png') ) # Save image
weighted_fig1


import folium

USA_counties = USA.merge(counties, on='GEOID', how='left')

m = folium.Map(location=[48, -102], zoom_start=4)


folium.Choropleth(
    geo_data=counties,
    name="USA",
    data=USA_counties,
    columns=["GEOID", "sovi_USA"],
    key_on='feature.properties.GEOID',
    fill_color="RdBu_r",
    fill_opacity=0.7,
    line_opacity=0.2,
    legend_name="SoVI Score",
).add_to(m)

folium.LayerControl().add_to(m)

m


# Read in data
table2 = pd.read_csv( here(path["ddpub"], "state_fema_us_rank_correlations.csv") )

# Formatting table
table2.index = table2["Unnamed: 0"]
table2 = table2.drop(columns = ["Unnamed: 0"])
table2["FEMA Region"] = ["I", "II", "III", "IV", "V", "VI", "VII", "VIII", "IX", "X"]
table2["All US counties input file versus all counties in a state input file"] = table2["spearman_r_st_us"].round(2)
table2["All counties in a FEMA region versus counties in a state within the FEMA region input file"] = table2["spearman_r_st_fema"].round(2)
table2["State used for comparison"] = ["Composite of ME, NH, MA", "NY", "VA", "GA", "IL", "TX", "MO", "SD", "CA", "ID"]
table2 = table2.transpose().rename_axis('FEMA Region', axis='columns')
table2 = table2.rename(columns=table2.iloc[4])
table2 = table2.drop(labels = ["FEMA Region", "spearman_r_st_fema", "spearman_r_st_us"], axis = 0)


if (table2.loc["pvalue_st_fema"] < 0.01).sum() == 10 and (table2.loc["pvalue_st_us"] < 0.01).sum() == 10:
    table2 = table2.drop(["pvalue_st_fema", "pvalue_st_us"])
    print("p < 0.01 for all values")
else:
    print("Different result than in paper")

p < 0.01 for all values


table2


# Save file
table2.to_csv( here(path["rtab"],"table2.csv") )


table2b = pd.read_csv( here(path["rtab"],"table2_weighted.csv") )
# Formatting table
table2b.index = table2b["Unnamed: 0"]
table2b = table2b.drop(columns = ["Unnamed: 0"])

table2b


summary_stats


# Save file
summary_stats.to_csv( here(path["rtab"],"fig2.csv") )


fig2b = pd.read_csv(here(path["rtab"],"fig2_weighted.csv"), index_col = "Unnamed: 0")
fig2b


df_diff = pd.DataFrame(index = summary_stats.index)

df_diff['Diff_Reversals'] = summary_stats['Reversals'] - fig2b['Reversals']
df_diff['Diff_Min'] = summary_stats['Min'] - fig2b['Min']
df_diff['Diff_Average'] = summary_stats['Average'] - fig2b['Average']
df_diff['Diff_Max'] = summary_stats['Max'] - fig2b['Max']
df_diff['Diff_Range'] = summary_stats['Range'] - fig2b['Range']

df_diff

	Reproduction Label	Spielman Label	Alias	Definition	Type	Domain	Missing Data Value(s)	Missing Data Frequency
0	GEOID	Geo_FIPS	FIPS code unique identifier	Unique code for every county and county-equiva...	string	01001 - 56045	None	0.0
1	B01002_001E	ACS12_5yr_B01002001	median age	MEDIAN AGE BY SEX: Estimate!!Median age!!Total	float64	21.7 - 63	NaN	0.0
2	B03002_001E	ACS12_5yr_B03002001	total population of respondents to race/ethnicity	HISPANIC OR LATINO ORIGIN BY RACE: Estimate!!T...	int64	66 - 9840024	NaN	0.0
3	B03002_004E	ACS12_5yr_B03002004	total Black population	HISPANIC OR LATINO ORIGIN BY RACE: Estimate!!T...	int64	0 - 1267825	NaN	0.0
4	B03002_005E	ACS12_5yr_B03002005	total Native American population	HISPANIC OR LATINO ORIGIN BY RACE: Estimate!!T...	int64	0 - 59060	NaN	0.0
5	B03002_006E	ACS12_5yr_B03002006	total Asian population	HISPANIC OR LATINO ORIGIN BY RACE: Estimate!!T...	int64	0 - 1343920	NaN	0.0
6	B03002_012E	ACS12_5yr_B03002012	total Latinx population	HISPANIC OR LATINO ORIGIN BY RACE: Estimate!!T...	int64	0 - 4694846	NaN	0.0
7	B06001_002E	ACS12_5yr_B06001002	total population under 5 years of age	PLACE OF BIRTH BY AGE IN THE UNITED STATES: Es...	float64	0 - 651662	NaN	78.0
8	B09020_001E	ACS12_5yr_B09020001	total population over 65 years of age	RELATIONSHIP BY HOUSEHOLD TYPE (INCLUDING LIVI...	int64	5 - 1078555	NaN	0.0
9	B01003_001E	ACS12_5yr_B01003001	total population	TOTAL POPULATION: Estimate!!Total	int64	66 - 9840024	NaN	0.0
10	B25008_001E	ACS12_5yr_B25008001	total population in occupied housing units	TOTAL POPULATION IN OCCUPIED HOUSING UNITS BY ...	int64	62 - 9664175	NaN	0.0
11	B25002_002E	ACS12_5yr_B25002002	total occupied housing units	OCCUPANCY STATUS: Estimate!!Total!!Occupied	int64	35 - 3218511	NaN	0.0
12	B25003_003E	ACS12_5yr_B25003003	total renter occupied housing units	TENURE: Estimate!!Total!!Renter occupied	int64	14 - 1695180	NaN	0.0
13	B25002_001E	ACS12_5yr_B25002001	total housing units for which occupancy status...	OCCUPANCY STATUS: Estimate!!Total	int64	70 - 3441416	NaN	0.0
14	B09020_021E	ACS12_5yr_B09020021	total 65+ living in group quarters	RELATIONSHIP BY HOUSEHOLD TYPE (INCLUDING LIVI...	int64	0 - 37611	NaN	0.0
15	B01001_026E	ACS12_5yr_B01001026	total female population	SEX BY AGE: Estimate!!Total!!Female	int64	20 - 4987765	NaN	0.0
16	B11001_006E	ACS12_5yr_B11001006	total female-headed family households	HOUSEHOLD TYPE (INCLUDING LIVING ALONE): Estim...	int64	0 - 498851	NaN	0.0
17	B11001_001E	ACS12_5yr_B11001001	total households for which household type is k...	HOUSEHOLD TYPE (INCLUDING LIVING ALONE): Estim...	int64	35 - 3218511	NaN	0.0
18	B25002_003E	ACS12_5yr_B25002003	total vacant housing units	OCCUPANCY STATUS: Estimate!!Total!!Vacant	int64	35 - 245069	NaN	0.0
19	B19025_001E	ACS12_5yr_B19025001	aggregate household income	AGGREGATE HOUSEHOLD INCOME IN THE PAST 12 MONT...	int64	1785600 - 263044380000	NaN	0.0
20	B23022_025E	ACS12_5yr_B23022025	total males unemployed for last 12 months	SEX BY WORK STATUS IN THE PAST 12 MONTHS BY US...	int64	1 - 726803	NaN	0.0
21	B23022_049E	ACS12_5yr_B23022049	total females unemployed for last 12 months	SEX BY WORK STATUS IN THE PAST 12 MONTHS BY US...	int64	0 - 1131737	NaN	0.0
22	B23022_001E	ACS12_5yr_B23022001	total population for which unemployment and se...	SEX BY WORK STATUS IN THE PAST 12 MONTHS BY US...	int64	45 - 6658456	NaN	0.0
23	B17021_002E	ACS12_5yr_B17021002	total population below poverty level	POVERTY STATUS OF INDIVIDUALS IN THE PAST 12 M...	int64	0 - 1658231	NaN	0.0
24	B17021_001E	ACS12_5yr_B17021001	total population for which poverty information...	POVERTY STATUS OF INDIVIDUALS IN THE PAST 12 M...	int64	64 - 9684503	NaN	0.0
25	B25024_010E	ACS12_5yr_B25024010	number of mobile home housing units in structure	UNITS IN STRUCTURE: Estimate!!Total!!Mobile home	int64	0 - 85377	NaN	0.0
26	B25024_001E	ACS12_5yr_B25024001	total housing units in structure	UNITS IN STRUCTURE: Estimate!!Total	int64	70 - 3441416	NaN	0.0
27	C24010_038E	ACS12_5yr_C24010038	total female employed	SEX BY OCCUPATION FOR THE CIVILIAN EMPLOYED PO...	int64	12 - 2056023	NaN	0.0
28	C24010_001E	ACS12_5yr_C24010001	total population for which sex and occupation ...	SEX BY OCCUPATION FOR THE CIVILIAN EMPLOYED PO...	int64	54 - 4495118	NaN	0.0
29	B19055_002E	ACS12_5yr_B19055002	total households with social security income	SOCIAL SECURITY INCOME IN THE PAST 12 MONTHS F...	int64	9 - 726298	NaN	0.0
30	B19055_001E	ACS12_5yr_B19055001	total households for which social security inc...	SOCIAL SECURITY INCOME IN THE PAST 12 MONTHS F...	int64	35 - 3218511	NaN	0.0
31	B09002_002E	ACS12_5yr_B09002002	total children in married couple families	OWN CHILDREN UNDER 18 YEARS BY FAMILY TYPE AND...	int64	0 - 1380977	NaN	0.0
32	B09002_001E	ACS12_5yr_B09002001	total children for which family type and age a...	OWN CHILDREN UNDER 18 YEARS BY FAMILY TYPE AND...	int64	0 - 2032147	NaN	0.0
33	B19001_017E	ACS12_5yr_B19001017	total households with over 200k income	HOUSEHOLD INCOME IN THE PAST 12 MONTHS (IN 201...	int64	0 - 208954	NaN	0.0
34	B06007_005E	ACS12_5yr_B06007005	total Spanish-speakers who speak english less ...	PLACE OF BIRTH BY LANGUAGE SPOKEN AT HOME AND ...	float64	0 - 1695391	NaN	78.0
35	B06007_008E	ACS12_5yr_B06007008	total people who speak another language and sp...	PLACE OF BIRTH BY LANGUAGE SPOKEN AT HOME AND ...	float64	0 - 743418	NaN	78.0
36	B06007_001E	ACS12_5yr_B06007001	total population with known language spoken at...	PLACE OF BIRTH BY LANGUAGE SPOKEN AT HOME AND ...	float64	66 - 9188362	NaN	78.0
37	B16010_002E	ACS12_5yr_B16010002	total population with less than a high school ...	EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...	int64	5 - 1508273	NaN	0.0
38	B16010_001E	ACS12_5yr_B16010001	total for which education, employment, languag...	EDUCATIONAL ATTAINMENT AND EMPLOYMENT STATUS B...	int64	65 - 6380366	NaN	0.0
39	C24050_002E	ACS12_5yr_C24050002	total population in extractive industries	INDUSTRY BY OCCUPATION FOR THE CIVILIAN EMPLOY...	int64	0 - 54942	NaN	0.0
40	C24050_001E	ACS12_5yr_C24050001	total population for which industry known	INDUSTRY BY OCCUPATION FOR THE CIVILIAN EMPLOY...	int64	54 - 4495118	NaN	0.0
41	C24050_029E	ACS12_5yr_C24050029	total people in service occupations	INDUSTRY BY OCCUPATION FOR THE CIVILIAN EMPLOY...	int64	4 - 837607	NaN	0.0
42	B08201_002E	ACS12_5yr_B08201002	total households with no available vehicle	HOUSEHOLD SIZE BY VEHICLES AVAILABLE: Estimate...	int64	0 - 577967	NaN	0.0
43	B08201_001E	ACS12_5yr_B08201001	total households for which vehicle status and ...	HOUSEHOLD SIZE BY VEHICLES AVAILABLE: Estimate...	int64	35 - 3218511	NaN	0.0
44	B25064_001E	ACS12_5yr_B25064001	median gross rent	MEDIAN GROSS RENT (DOLLARS): Estimate!!Median ...	int64	275 - 1678	NaN	0.0
45	B25077_001E	ACS12_5yr_B25077001	median home value	MEDIAN VALUE (DOLLARS): Estimate!!Median value...	float64	19400 - 944100	NaN	1.0

Label	Alias	Definition	Type	Accuracy	Domain	Missing Data Value(s)	Missing Data Frequency
SE_T02A_002	Land area	Area (Land) in square miles	float64	...	1.998779 - 145504.8	nan	0
Geo_FIPS	FIPS code unique identifier	Unique code for every county and county-equivalent in USA	string	...	g01001 - g56045	None	0

Label	Alias	Definition	Type	Domain	Missing Data Value(s)
STATE	State-level FIPS code	State-level FIPS code	string	01 - 56	None
COUNTY	County-level FIPS code	County-level FIPS code	string	001 - 840	None
CENSUSAREA	land area	land area in square miles	float64	1.999 - 145504.789	NaN

	GEOID	CENSUSAREA	Geo_FIPS	SE_T02A_002
0	01001	594.436	01001	594.4361
1	01005	884.876	01005	884.8763
2	01007	622.582	01007	622.5823
3	01009	644.776	01009	644.7759
4	01011	622.805	01011	622.8047
...	...	...	...	...
2285	55137	626.153	55137	626.1533
2286	55141	793.116	55141	793.1165
2287	56013	9183.814	56013	9183.8130
2288	56023	4076.129	56023	4076.1300
2289	56037	10426.649	56037	10426.6500

	0	1	2	3	4	5	6	7
MEDAGE_ACS	0.028558	-0.097043	0.263585	-0.021732	0.064951	-0.047413	0.041284	-0.020852
BLACK_ACS	0.231124	-0.081235	-0.032835	-0.128022	0.035158	-0.172086	-0.031832	-0.086503
QNATAM_ACS	-0.120374	0.051713	0.034822	-0.061283	-0.125527	0.649159	0.025288	0.040600
QASIAN_ACS	0.059697	-0.177355	0.004209	0.044335	0.096726	-0.020888	-0.089701	-0.006796
QHISP_ACS	-0.059316	0.021661	0.036007	0.430707	0.004060	-0.100344	0.094955	0.134917
QAGEDEP_ACS	-0.032367	0.008297	0.231777	0.132433	0.137210	0.043627	0.235682	0.003748
QPUNIT_ACS	-0.025747	0.041537	-0.025242	0.124587	-0.292591	0.168631	0.109509	-0.084228
PRENTER_ACS	0.040047	0.009589	-0.232451	0.042722	0.268396	-0.015020	-0.015465	0.069733
QNRRES_ACS	-0.028948	0.071364	-0.074561	0.021579	0.395471	-0.046200	-0.074874	0.067630
QFEMALE_ACS	-0.029268	0.039311	0.079810	0.115028	-0.091273	0.036534	0.704244	-0.117437
QFHH_ACS	0.144722	-0.003821	-0.043494	0.005639	-0.033313	0.016059	0.107111	0.014858
QUNOCCHU_ACS	0.007502	-0.099646	0.285825	0.023212	-0.113704	0.177186	-0.127205	0.126578
PERCAP_ALT	0.033256	0.178163	-0.038022	0.052142	-0.033767	0.034974	0.028228	0.024147
QESL_ALT	-0.018957	-0.042262	0.030951	0.386709	0.049780	-0.036214	0.067625	0.043922
QCVLUN	0.197523	-0.049893	0.087432	-0.044395	-0.101308	-0.093135	-0.127564	-0.018783
QPOVTY	0.123177	0.050919	-0.003049	0.039227	0.032417	0.060821	0.015164	0.024725
QMOHO	0.158646	-0.007006	0.098538	-0.064254	-0.184658	-0.079373	-0.104182	-0.184974
QED12LES_ALT	0.156469	0.022291	0.038671	0.155598	-0.000912	-0.104740	0.011764	-0.139073
QFEMLBR	-0.004034	0.001923	0.010762	-0.069927	-0.025838	0.005249	0.257073	0.330415
QEXTRCT_ALT	-0.027330	0.053829	0.036552	0.118236	0.148751	0.117389	-0.091575	-0.241253
QSERV_ALT	-0.049305	0.014880	0.040729	0.115068	0.068498	0.006772	-0.155902	0.606872
QSSBEN	0.077827	-0.019041	0.244719	0.030591	0.020558	-0.038686	0.090472	-0.004187
QNOAUTO_ALT	0.143955	-0.112268	0.029233	-0.053675	0.207615	0.352145	-0.013351	-0.154331
QFAM	0.147075	-0.021550	-0.012901	-0.026884	0.081235	-0.000337	0.010718	0.116535
QRICH200K	-0.031148	0.213393	-0.076973	-0.012445	0.041138	0.006493	-0.028958	0.103793
MDGRENT_ALT	-0.012350	0.200516	-0.058312	-0.016243	0.143973	0.045744	-0.001854	-0.075001
MHSEVAL_ALT	-0.019246	0.235548	-0.107857	0.000564	0.081574	-0.005632	0.016087	-0.056065
POPDENS	0.162733	-0.206821	0.063250	0.000379	0.266078	0.132037	0.005506	-0.268089

	0	1	2	3	4	5	6	7
0	0.165896	-0.137757	-0.618567	-0.717923	-1.456492	-0.405764	0.410837	-0.618133
1	-0.297981	-0.686906	0.796419	-0.271290	-1.107835	-0.194036	0.373707	0.206866
2	2.433497	0.271635	-0.362751	-0.746723	-0.024033	-0.564124	-1.006033	-0.025726
3	1.334588	0.523674	-0.110207	-0.743303	-1.689044	-0.520783	-1.446104	-1.766887
4	0.087161	0.549725	0.039776	0.159100	-1.157814	-0.507636	0.191213	-1.280130
...	...	...	...	...	...	...	...	...
3138	-0.851567	-0.299621	-1.266409	0.118841	-0.843377	-0.213839	-1.582643	-1.401207
3139	-1.635098	-3.099562	0.514943	0.475535	-2.037896	0.406036	-1.526467	1.714776
3140	-0.800701	0.234603	-0.946891	0.062005	-1.206986	0.121283	-0.800773	-0.625157
3141	-1.078669	0.255776	-0.265670	0.158393	0.625822	-0.130143	0.204333	-0.122033
3142	-0.592648	0.270919	0.020170	-0.486524	0.526319	-0.129719	-1.766916	-1.475899

	0
0	0.177494
1	0.162138
2	0.128328
3	0.090344
4	0.064942
5	0.055485
6	0.055178
7	0.052353

	GEOID	sovi_x	rank_x	sovi_y	rank_y
1150	g22075	-4.53	2984.0	-4.53	2983.0
3120	g56001	-4.53	2983.0	-4.53	2984.0

	GEOID	sovi_x	rank_x	fema_region_x	sovi_y	rank_y	fema_region_y
67	g02013	NaN	NaN	NaN	NaN	NaN	NaN
68	g02016	NaN	NaN	NaN	NaN	NaN	NaN
69	g02020	NaN	NaN	NaN	NaN	NaN	NaN
70	g02050	NaN	NaN	NaN	NaN	NaN	NaN
71	g02060	NaN	NaN	NaN	NaN	NaN	NaN
72	g02068	NaN	NaN	NaN	NaN	NaN	NaN
73	g02070	NaN	NaN	NaN	NaN	NaN	NaN
74	g02090	NaN	NaN	NaN	NaN	NaN	NaN
75	g02100	NaN	NaN	NaN	NaN	NaN	NaN
76	g02105	NaN	NaN	NaN	NaN	NaN	NaN
77	g02110	NaN	NaN	NaN	NaN	NaN	NaN
78	g02122	NaN	NaN	NaN	NaN	NaN	NaN
79	g02130	NaN	NaN	NaN	NaN	NaN	NaN
80	g02150	NaN	NaN	NaN	NaN	NaN	NaN
81	g02164	NaN	NaN	NaN	NaN	NaN	NaN
82	g02170	NaN	NaN	NaN	NaN	NaN	NaN
83	g02180	NaN	NaN	NaN	NaN	NaN	NaN
84	g02185	NaN	NaN	NaN	NaN	NaN	NaN
85	g02188	NaN	NaN	NaN	NaN	NaN	NaN
86	g02195	NaN	NaN	NaN	NaN	NaN	NaN
87	g02198	NaN	NaN	NaN	NaN	NaN	NaN
88	g02220	NaN	NaN	NaN	NaN	NaN	NaN
89	g02230	NaN	NaN	NaN	NaN	NaN	NaN
90	g02240	NaN	NaN	NaN	NaN	NaN	NaN
91	g02261	NaN	NaN	NaN	NaN	NaN	NaN
92	g02270	NaN	NaN	NaN	NaN	NaN	NaN
93	g02275	NaN	NaN	NaN	NaN	NaN	NaN
94	g02282	NaN	NaN	NaN	NaN	NaN	NaN
95	g02290	NaN	NaN	NaN	NaN	NaN	NaN
546	g15001	NaN	NaN	NaN	NaN	NaN	NaN
547	g15003	NaN	NaN	NaN	NaN	NaN	NaN
548	g15005	NaN	NaN	NaN	NaN	NaN	NaN
549	g15007	NaN	NaN	NaN	NaN	NaN	NaN
550	g15009	NaN	NaN	NaN	NaN	NaN	NaN

FEMA Region	I	II	III	IV	V	VI	VII	VIII	IX	X
All US counties input file versus all counties in a state input file	0.75	0.79	0.68	0.5	0.5	0.62	0.9	0.61	0.53	0.66
All counties in a FEMA region versus counties in a state within the FEMA region input file	0.94	0.61	0.9	0.8	0.34	0.68	0.82	0.87	0.69	0.88
State used for comparison	Composite of ME, NH, MA	NY	VA	GA	IL	TX	MO	SD	CA	ID

	I	II	III	IV	V	VI	VII	VIII	IX	X
Unnamed: 0
All US counties input file versus all counties in a state input file	0.73	0.74	0.67	0.5	0.51	0.62	0.9	0.59	0.48	0.65
All counties in a FEMA region versus counties in a state within the FEMA region input file	0.9	0.58	0.89	0.8	0.36	0.68	0.8	0.85	0.69	0.82
State used for comparison	Composite of ME, NH, MA	NY	VA	GA	IL	TX	MO	SD	CA	ID

	National Model	Reversals	Min	Average	Max	Range
QAGEDEP_ACS	+	0	1.0	9.52	24.0	23.0
QFEMALE_ACS	+	6	1.0	11.33	28.0	27.0
QSSBEN	+	0	1.0	10.33	21.0	20.0
QHISP_ACS	+	3	1.0	11.10	28.0	27.0
QPOVTY	+	0	1.0	9.81	27.0	26.0
QED12LES_ALT	+	2	2.0	10.38	28.0	26.0
QSERV_ALT	+	2	1.0	13.19	26.0	25.0
- PERCAP_ALT	+	2	6.0	15.57	23.0	17.0
QESL_ALT	+	4	1.0	11.43	27.0	26.0
- QFAM	+	0	1.0	11.33	28.0	27.0
QNOAUTO_ALT	+	0	1.0	12.43	28.0	27.0
- MDGRENT_ALT	+	0	8.0	16.76	27.0	19.0
QUNOCCHU_ACS	+	5	1.0	12.24	26.0	25.0
QFHH_ACS	+	4	2.0	15.14	28.0	26.0
- QRICH200K	+	4	3.0	18.81	28.0	25.0
QFEMLBR	+	5	2.0	14.05	28.0	26.0
MEDAGE_ACS	+	1	9.0	18.24	27.0	18.0
- MHSEVAL_ALT	+	2	1.0	18.14	28.0	27.0
QNRRES_ACS	+	3	2.0	13.33	27.0	25.0
QEXTRCT_ALT	+	2	1.0	17.43	28.0	27.0
QNATAM_ACS	+	8	1.0	14.81	25.0	24.0
QCVLUN	+	2	1.0	14.14	27.0	26.0
POPDENS	+	3	2.0	15.95	28.0	26.0
QASIAN_ACS	-	11	4.0	18.90	28.0	24.0
QPUNIT_ACS	+	16	2.0	17.05	28.0	26.0
PRENTER_ACS	+	6	3.0	20.19	27.0	24.0
QMOHO	+	8	2.0	17.24	28.0	26.0
BLACK_ACS	-	13	2.0	17.14	28.0	26.0

	Diff_Reversals	Diff_Min	Diff_Average	Diff_Max	Diff_Range
QAGEDEP_ACS	0	0.0	1.38	0.0	0.0
QFEMALE_ACS	3	0.0	-3.00	0.0	0.0
QSERV_ALT	1	0.0	-0.71	0.0	0.0
QHISP_ACS	0	1.0	-0.10	0.0	-1.0
QFEMLBR	-1	-1.0	-2.43	-3.0	-2.0
QNATAM_ACS	1	0.0	-5.19	3.0	3.0
QESL_ALT	0	0.0	-0.14	0.0	0.0
QSSBEN	0	2.0	1.43	0.0	-2.0
QNOAUTO_ALT	0	0.0	1.62	0.0	0.0
QMOHO	4	0.0	-1.95	-1.0	-1.0
QPOVTY	0	2.0	1.90	0.0	-2.0
QNRRES_ACS	1	-1.0	-3.14	-5.0	-4.0
- QFAM	2	0.0	2.34	0.0	0.0
QUNOCCHU_ACS	1	0.0	-1.53	-6.0	-6.0
- PERCAP_ALT	1	3.0	2.91	5.0	2.0
BLACK_ACS	-1	1.0	0.15	0.0	-1.0
- MDGRENT_ALT	1	0.0	1.53	0.0	0.0
- QRICH200K	2	0.0	0.38	-1.0	-1.0
MEDAGE_ACS	0	-1.0	-0.05	1.0	2.0
QFHH_ACS	2	0.0	2.24	0.0	0.0
PRENTER_ACS	1	0.0	-1.62	0.0	0.0
POPDENS	0	2.0	-0.09	-2.0	-4.0
QCVLUN	13	0.0	-0.76	-4.0	-4.0
- MHSEVAL_ALT	2	4.0	1.15	0.0	-4.0
QED12LES_ALT	0	2.0	4.62	0.0	-2.0
QEXTRCT_ALT	4	0.0	0.28	0.0	0.0
QASIAN_ACS	3	0.0	-0.19	0.0	0.0
QPUNIT_ACS	-2	0.0	-1.00	0.0	0.0

Reproduction of Spielman et al.’s 2020 Evaluation of the Social Vulnerability Index¶

Authors¶

Abstract¶

Keywords¶

Study design¶

Spatio-temporal metadata¶

Materials and procedure¶

Computational environment¶

Data and variables¶

(1) 2008-2012 American Community Survey (5-year)¶

(2) 2010 Decennial Census¶

(3) USA Counties Shapefile¶

(4) USA Counties Cartographic Boundaries¶

Data transformations¶

Analysis¶

Principal Component Analysis¶

Basic set-up for storing results¶

Calculating SoVI¶

Factor loadings for national model¶

Internal consistency analysis¶

Theoretical consistency analysis¶

Save analysis results¶

Results¶

Rpr-H1¶

US SoVI Scores & Rankings¶

FEMA Region SoVI Scores & Rankings¶

State SoVI Scores & Rankings¶

County in State Rank¶

Variable Contributions¶

State FEMA US Rank Correlations¶

RPr-H2¶

Figure 1 - Comparison of Spatial Inputs and SoVI Outputs for California¶

Figure 1b¶

Figure 2 - National SoVI Outputs¶

Table 2¶

Table 2b¶

RPr-H3¶

Figure 2¶

Figure 2b¶

Figure 2c¶

Discussion¶

Outdated packages¶

Extraneous data and code¶

Confusing file system¶

Incomplete code¶

Our own edits¶

Additional Reanalysis Edits¶

Conclusions¶

References¶

Funding¶