Debugging my thesis
Fri Dec 25 2020tags: public thesis programming debugging
Introduction
Since my undergraduate thesis was quite well-received, I thought it would be a good idea to publish it. To that end, Gabe and I have been working on my thesis. As a replication check both of us ran the code on our machines... both of our laptops produced the same data but they were different from the existing data generated by my old laptop.
In my thesis I generate 100,000 hypothetical districting plans, calculate four different compactness metrics (Polsby-Popper, Reock, Convex Hull, and my own Human Compactness) and see how they correlate with a measure of district homogeneity (Stephanopoulos's "spatial diversity" metric). On our replication runs we observed the following:
- Spatial diversity metric unchanged
- Polsby-Popper, Reock, and Convex Hull very slightly changed (around 1/10th of the values were different in the 11th or 12th decimal place)
- Big changes in human compactness metric (from 0.7 to 0.9, that sort of thing)
This was pretty bad, and threatened to derail my research completely
What wasn't the bug?
What was the bug?
Because I didn't freeze the environment, we had to install the newest versions of the libraries.
There is one line of code in sample_rvps
all_rvps = all_rvps.to_crs("EPSG:4326")
which converts the initial shapefile to a new coordinate reference system (CRS).
For some reason, the to_crs
function in the new setup
gives very slightly different values, differing in the 11th or 12th decimal place.
You can see it in the snippet below:
gdf = gdf.to_crs({"init": "epsg:4326"})
print(str(gdf["geometry"][0])) # on old machine: POINT (-73.59250979083568 41.01678126113871)
# This assert passes on my new machine but fails on my old one
assert str(gdf["geometry"][0]) == "POINT (-73.59250974534255 41.01678109722333)"
print("All asserts passed!")
This alone wouldn't cause that much of a discrepancy,
but it just so happens that this tiny change causes
a point which was originally out of the state line
to be in the state line.
This causes the points sampled by sample_rvps
to change completely even with the same random seed,
because all the points have been shifted down 1.
Because the points have all changed,
all the precalculated point-kNN values are all incorrect
(because they were for different points).
And this leads to completely incorrect results.
The other metrics all changed by a very very tiny amount.
initial_partition = GeographicPartition(
graph,
assignment="New_Seed",
updaters={
"cut_edges": cut_edges,
"population": Tally("population", alias="population"),
"spatial_diversity": spatial_diversity.calc_spatial_diversity,
"human_compactness": partial(
hc_utils.calculate_human_compactness,
DURATION_DICT,
tract_dict,
duration_matrix,
),
"polsby_compactness": polsby_popper,
"reock_compactness": partial(reock.reock, state_shp),
},)
We know the EPSG conversion is very slightly different in the new state. But there is no EPSG conversion in the Reock function!
My best guess (but I haven't verified this)
is that there is an upstream change in how points are calculated.
The Reock function calls the _generate_external_points
subroutine
to find all external points of the Census tracts
so that it can draw the circumscribing circle of the district:
def _generate_external_points(row, geoid_to_id_mapping):
'''
For each Census Tract, generate the exterior points of that Tract
'''
exterior_points = []
# print(row['geometry'].geom_type)
if row['geometry'].geom_type == 'MultiPolygon':
for polygon in row['geometry']:
exterior_points += list(polygon.exterior.coords)
else:
assert(row['geometry'].geom_type == 'Polygon')
polygon = row['geometry']
exterior_points += list(polygon.exterior.coords)
return exterior_points
I believe there might be a difference with polygon.exterior.coords
.
Perhaps there's something in the getting points from a shapefile thing that has changed.
If the exterior points' coordinates are very slightly different,
this will affect the Reock function.
What causes Polsby-Popper to change?
TL;DR: I have no idea.
I dug into the Gerrychain source code to look at the Polsby-Popper function:
import math
def compute_polsby_popper(area, perimeter):
try:
return 4 * math.pi * area / perimeter ** 2
except ZeroDivisionError:
return math.nan
def polsby_popper(partition):
"""
Computes Polsby-Popper compactness scores for each district in the partition.
"""
return {
part: compute_polsby_popper(
partition["area"][part], partition["perimeter"][part]
)
for part in partition.parts
}
It gets area
and perimeter
from the Partition object,
which takes in a graph, an assignment, and a list of updaters.
In 10_Process_Tracts.py
we have the following code:
state_graph = Graph.from_geodataframe(state_tracts, ignore_errors = True)
state_graph.add_data(state_tracts)
state_tracts.to_file(f"./Data/Shapefiles/Tract2000_{fips}.shp")
state_graph.to_json(f"./Data/Dual_Graphs/Tract2000_{fips}.json")
Looking at the GerryChain docs for from_geodataframe:
Creates the adjacency Graph of geometries described by dataframe. The areas of the polygons are included as node attributes (with key area). The shared perimeter of neighboring polygons are included as edge attributes (with key shared_perim). Nodes corresponding to polygons on the boundary of the union of all the geometries (e.g., the state, if your dataframe describes VTDs) have a boundary_node attribute (set to True) and a boundary_perim attribute with the length of this “exterior” boundary. By default, areas and lengths are computed in a UTM projection suitable for the geometries. This prevents the bizarro area and perimeter values that show up when you accidentally do computations in Longitude-Latitude coordinates. If the user specifies reproject=False, then the areas and lengths will be computed in the GeoDataFrame’s current coordinate reference system. This option is for users who have a preferred CRS they would like to use.
So the areas and perimeters of the polygons have already been pre-calculated and saved. As far as I can tell, the areas and perimeters of the polygons are not recalculated. And there doesn't seem to be a reprojection either. Why then can the Polsby-Popper metric still give a different result?
Questions I still have
Why does to_crs
give a very slightly different result?
I created an issue + minimum reproducible example and posted it on the
geopandas repo.
I am almost certain it is not the fault of geopandas but rather some upstream
dependency. This is why reock
and polsby-popper
are affected as well.
What's next
Things I learned
Bit-by-bit replication should be done with conda list -e > requirements.txt
.
Not using conda list -e
or pip freeze
or the like
is a mistake that I won't make again.