You should be geocoding crime data locally

If you google “how to geocode data” – the process of taking an address and figuring out the X/Y coordinate data of that address – almost all of the search results will bring up online geocoding engines, such as google or ESRI’s world geocoding engine.

Pretty much for all crime data analysis it does not make sense to use these services. These online geocoding engines cost money, typically something around $5 per 1000 geocodes. These services are often not meant for geocoding data and saving the results to a database, they are meant for real time web applications. Google’s for example you are not supposed to cache the data at all.

Caching the data is a pre-requisite to be able to conduct any geospatial analysis of your crime data, such as hotspot or near-repeat crime analysis. That is, you geocode the events, and then save those X/Y coordinates to a table, and then you can later conduct whatever analysis you want.

Most crime analysts need to batch geocode many crime events – I commonly need to geocode over 100,000 crime events. This tutorial I am going to show how to use ESRI’s tools to create a local geocoding engine, using python code. Once you create a local engine, you can geocode as many crime events as you want, there is no limit.

Other options for geocoding locally are much lower quality than ESRI’s tool. ESRI has a much better engine to match fuzzy address strings than any of the other public options. Also, online engines may incorrectly match addresses far outside your jurisdiction, if you build a local engine for just the area of interest, this will never happen.

So this tutorial requires that you have an ESRI license and ArcPro installed on your machine. (ESRI’s tools are also very fast compared to others.) Note that a cheaper ESRI license is sufficient for this, the more expensive enterprise server licensing is not necessary.

Setting up the environment

First, if you have Arc Pro installed on your computer, it installs a local conda environment. It has all the libraries that you need for this example.

To follow along using the same data, you will need two different files:

For setup, I am saving several of these files in different locations. I do this because many of the arcpy functions I am going to show work with files on disk, not in memory in python. So I do that to pedagogically show they are independent.

So say your project data is located at D:\Projects\Geocode. I then save the crime data as a CSV file at D:\Projects\Geocode\address\data.csv, and the Durham street ranges shape file at D:\Projects\Geocode\shape_loc\Road.shp. I also create a folder, D:\Projects\Geocode\database, to save a file geodatabase.

I also created in the crime data a new field, called FULL_ADD, that replaces 00 with 50, and then appends , DURHAM to the address field. The public Durham crime data is fuzzed to the 100 block range, so this will place the geocode in the middle of the street (but won’t be necessary if you are a crime analyst and you have your local not fuzzed data).

The Durham street range data does not have zipcodes in it, but does have the city, so including the city in the single line address range improves the geocoding hit rate.

Creating the locator

So I assume that after you have opened up the python command prompt (that is nested in your Arc Pro install, for example mine is located at C:\Program Files\ArcGIS\Pro\bin\Python\envs\arcgispro-py3), and then cd to your working directory, e.g. cd "D:\Projects\Geocode".

Now you can use the arcpy library to essentially do all the stuff you can do in the ArcPro GUI application (or formerly the desktop application). This code has several steps:

Step 2 is the most painful, as it requires figuring out the field mapping, and then dealing with single quotes inside of a string. Here I show how to use a dictionary to ease that process a bit though in python code.

# Creating a locator
import arcpy

# Step 1: Set to where your shapefile is
arcpy.env.workspace = r'./shape_loc'

# Step 2: Prepping the field mapping input
field_str = ""
add_type='StreetAddress'
tab = 'Road'
map_di = {'HOUSE_NUMBER_FROM_LEFT': 'ADD_LO_EVE',
          'HOUSE_NUMBER_TO_LEFT': 'ADD_HI_EVE',
          'HOUSE_NUMBER_FROM_RIGHT': 'ADD_LO_ODD',
          'HOUSE_NUMBER_TO_RIGHT': 'ADD_HI_ODD',
          'STREET_PREFIX_DIR': 'ROAD_PRE',
          'STREET_NAME': 'ROAD_NAME',
          'STREET_SUFFIX_TYPE': 'ROAD_TYPE',
          'CITY_LEFT': 'CITYNAMEL',
          'CITY_RIGHT': 'CITYNAMER',
          'STREET_SUFFIX_DIR': 'ROAD_SUF'}

for l,r in field_map.items():
    field_str += f"'{add_type}.{l} {tab}.{r}';"

print(field_str)

# Step 3: creating the locator file
arcpy.geocoding.CreateLocator("USA",
                              "shape StreetAddress",
                              "./locator_loc/DurLoc.loc",
                              field_map,
                              "ENG")

Now, one part of this is that the arcpy.geocoding.CreateLocator, you pretty much just create a locator with many of the default settings. To update the settings, for address ranges you can set the side offset and end offset for example, you need to reload it, set the object attribute, and then resave the locator.

# reload it and set the offset to 0 for middle of street
loc = arcpy.geocoding.Locator("./locator_loc/DurLoc.loc")
loc.sideOffset = 0
loc.updateLocator()

Final note is that this creates the locator in whatever projection system the original street address range shapefiles were in. You likely want to have the data in a local projection, for Durham e.g. EPSG:2264, whereas webmapping almost always is in EPSG:4323 (lat/lon).

Geocoding a file

Now that we have the locator saved, how do you use it? First, the arcpy module expects you to save the resulting file to a geodatabase, so if you need to create a new one you can do that in code:

# Need to save the file in a geodatabase
# so this creates a new one
arcpy.management.CreateFileGDB("./database", "localdb.gdb")

Now we can geocode the address file. Again there is an idiosyncratic way to map the input to what the geocoding locator expects (you can do separate fields, but here I just have a single line input address).

This is superfast, geocoding millions of addresses may only take a few minutes. Whereas with web applications you often need to limit the requests you make or batch the process in smaller subsets.

# have a csv file with the data
map_str = "'Single Line Input' FULL_ADD VISIBLE NONE;"

# geocoding the result
arcpy.geocoding.GeocodeAddresses('./address/data.csv',
                                 './locator_loc/DurLoc.loc',
                                 map_str,
                                 "./database/localdb.gdb/geo_res")

Now the geocoded table is saved in the file geodatabase. If you want to further do analysis of that table, you can read it back into memory in your python session into a pandas dataframe. To do that, you need to import a few different libraries from the arcgis python library.

# to read back into a dataframe in python
# need these arcgis libraries
from arcgis.features import GeoAccessor, GeoSeriesAccessor

sdf = pd.DataFrame.spatial.from_featureclass("./database/localdb.gdb/geo_res")

And then you can do whatever analysis of the records you want.

Wrapping Up

I realize this tutorial involves quite a few things, but that is one of the reasons I showcase it on the blog. Part of the services Crime De-Coder provides is helping with process automation. I figure this stuff out, so you do not have to.

This approach effectively lets you geocode any location in the United States, since the census has address range coverage (which in my experience is quite good) across the United States.

Writing this code is only part of the process. For an analyst, they likely want to create a standard set of processes to geocode and cache data on a regular schedule. It would also involve fixing bad addresses, which may have a separate process in either arcpy alternate address tables, or in python.

If you have millions of addresses that need to be geocoded, feel free to get in touch. You really don’t need to pay thousands of dollars to do that work using online tools.