Thursday, February 23, 2023

Geographic crime profiling: Analyzing Rossmo's Formula

 

There are many instances where applied mathematics is used to assist criminal investigators. From identifying suspects to analyzing evidence. For a long time I've been interested in the accuracy and reliability of these systems. After all an incorrect result (Type 1 and Type 2 errors) could ruin an innocent person's life or allow a criminal to go free. There are many recorded instances where mathematics has been applied successfully, but there is also a growing body of evidence that math has been misused or flat out abused in criminal proceedings. I've always had an interest in how technology is applied to the legal system and specifically how it can be tested for efficacy, reliability, accuracy, and precision.

In this post I will present an analysis of one famous geographic profiling model known as the Rossmo formula. The formula uses geographic profiling to determine locations which have a high probability of being associated with a serial offender, called anchor points. The core hypothesis is the locations where a serial offender encounters victims, and in the case of murders: where the victim is recovered, are picked by the offender using an unconscious balancing of the need to distance themselves from the crime for security, and the desire to not travel farther than necessary to commit their crimes. The formula was invented and patented by Kim Rossmo and later adapted into a suite of geographic profiling tools which can be used to help investigators plan their search efforts, starting with the highest probability areas and working outward (https://patents.justia.com/patent/5781704).

The data inputs are a list of geographic coordinates in (latitude, longitude) format which represent locations significant to a series of crimes reasonably believed to be committed by the same offender(s). The algorithm creates a grid over the geographic area bounded by the minimum and maximum latitude and longitude in the observed crimes. The algorithm then proceeds to assign scores to each grid square based on the distance to each crime location. Let's take a look at Mr. Rossmo's formula and see if we can gain any insight

Pulled from Kent State Lawn Sprinkler Problem
I know, it looks intimidating at first but it's really not bad. Lets start with the variables. Pij means the probability of the grid square located at i,j. It's the value we will calculate. The terms i and j are also used in the formula to index the row and column for the grid square. The constant k is used to scale the results and prevent value overflows based on the integer and floating point precision of the platform running the analysis. Given the improvements in precision we probably won't need to worry about this, unless you want to implement the code on a micro controller. The value n is the is the iteration counter which goes from 1 to the total number of crimes in the data set. The summation happens by comparing the grid square's location to each crime point (1 to n) using the Manhattan Distance function. The formula |Xi - xn| + |Yj-yn| in the denominators of both fractions (as well as the where clause) is the Manhattan Distance function. You find the Manhattan distance between two points by taking the absolute difference in the X coordinates and adding the absolute difference in the Y coordinates. Manhattan is a good choice for distance when the movement is constrained to a grid, like the streets of a big city such as Manhattan. , hence the name. Xi and Yj are the x and y grid coordinates of the grid square being tested. xn and yn are the x and y grid coordinates of the nth crime in the list. The term B is the radius of a buffer zone which is calculated as half of the mean nearest neighbor distance between crime locations in the data set.

The function Φij is equal to 1 if the Manhattan distance between the grid at i,j and the nth crime location is greater than the buffer radius we calculated B. Otherwise it is 0. This function acts as a selector for which of the two fractions impact the result. In the case where the distance is greater than the buffer radius, the numerator of the fraction on the left side of the addition will be 1 and the value in the numerator of the fraction on the right side will be (1-1) * (B^(g-f)) = 0 * (B^(g-f)), meaning only the first fraction will impact the result. Likewise when the distance is equal to or less than the calculated buffer radius the numerator on the left will be 0 and the numerator on the right will be (1-0) * (B^(g-f)) = B^(g-f), meaning only the second fraction will matter. Finally the terms g and f are decay exponents for the two competing priorities. They control the rate at which scores increase or decrease with relation to their distance from the crime. The values for g and f can be determined empirically (if you happen to have access to a large database of crimes and locations) or they can be chosen to model potential types crime patterns. George Rengert proposed four hypothetical spatial patterns that could be used to describe the geography of crime sites; (1) a uniform pattern with no distance-decay influence. If I'm understanding him correctly, this is effectively removing the influence of g and f on the results (g = f = 1). A bull's-eye pattern with spatial clustering, exhibiting distance-decay centered around the offender's primary anchor point, which is similar to what we see here in Rossmo's formula; (3) a bimodal pattern with crime clusters centered around two anchor points. This would require; and (4) a teardrop pattern with a directional bias oriented toward a secondary anchor point (Rengert 1991). For my test I used the values f=0.33 and g=0.666666 which describes a slightly slower decay than the one in the patent. In the patent 's description Rossmo uses 1.2 for both g and f so we'll examine both configurations as we go and hopefully gain some knowledge into how these key parameters impact the results.

Now let's examine the numerator of the second fraction In the case where the distance between the grid square being tested and the nth crime is less than or equal to the radius of the buffer zone (B) we calculated, the value of Φij (1) is multiplied by the buffer radius raised to the power of g-f. For example's sake let's assume a buffer radius of 5, which means the numerator for the second fraction is 1*(5^0.336666)=1.72 using my settings or 1 * (5^0) = 1 using Rossmo's referenced value. Already we can see a slight difference, as my numerator will cause the score value to be slightly higher for the same denominator value. For example, 1/4=0.25 while 1.72/4 = 0.43. One thing to note: the result for a given grid square is not a traditional probability constrained to values between 0-1. they are more akin to relative probability, in that higher numbers indicate higher probability. We can scale the results to the range of 0-1, where 0 indicates the lowest observed score and 1 indicates the highest observed score, using a minmax scaler for Scikit-learn. Since the scores are relative, the difference between these two hypothetical results doesn't indicate anything significant. Moreover, you cannot directly compare two grid square results generated using different parameters, as the relative score ranges may be very different. Scaling the scores before comparing them allows us to make an accurate comparison. For example, you cannot say the hypothetical grid square that scored 0.43 with my settings is more probable than the same grid square when calculated with Rossmo's settings, which scored 0.25. But if we observe that the score range with my settings is between 0 and 2.15, and with Rossmo's we observe the score range of 0 to 1.25, then we can scale the two results using minmax to get (0.43-0)/(2.15-0) = 0.2 and (0.25-0)/(1.25-0) = 0.2 respectively. We can now compare these two results and say the hypothetical grid square scored approximately the same, given the two different settings (about 20% of the way up the score scale in each case).

Next, let's examine the denominators of each fraction. In the case where the distance from the grid square to the nth crime is greater than the buffer radius we previously calculated (the left side fraction), we take the Manhattan distance between the two points and raise it the f power. This means as the distance between the two points increases the value of the fraction decreases exponentially which models the desire not to travel farther than necessary to commit a crime. For example, if the grid square is 3 miles from the nth crime location, the score would be 1/(3^0.33) =1.437 using my test settings and 1/(3^1.2)=0.268 for Rossmo's settings. If the distance is increased to 4 miles instead, the score would be 1/(4^0.33)=0.633 and 1/(4^1.2)=0.189 respectively. In both instances, as the distance increases past the edge of the buffer zone, the score decreases as desired.

In the second case, where the distance between the grid square and the nth crime is less than or equal to the buffer radius we calculated (the right hand fraction) we subtract the Manhattan Distance from the diameter of the buffer zone (2B means twice the radius. You remember geometry, don't you?) and then raise the result to the power of g. Let's recalculate our previous example using this formula and the example buffer radius of 5. If the distance between the points is 3 miles then the score would be 1.72/(10 - 3)^0.666666=0.47 using my test values and 1/(10 - 3)^1.2=0.097 using Rossmo's. If we increase the distance to 4 miles we expect the probability to go up since we are inside the buffer zone and the desire is to distances one's anchor points from the crimes to some ideal distance. The new results become 1.72/(10-4)^0.666666 = 0.521 for my test values and 1/(10 - 4)^1.2=0.116 for Rossmo's. Again, the formula behaves as expected and both numbers increase as the point approaches the edge of the buffer zone. 

We continue calculating these values for every crime location and then sum the values to get a total score for the grid square. You could instead choose to multiply the scores, which has the effect of amplifying the differences between scores. For example if one square gets the three scores (3,5,2) and another square gets the scores (3,3,2) the first method of addition gives you the total scores 10 and 8 respectively, a difference of 2 points. The product method gives us 30 and 18 respectively. a difference of 12 points. According to the patent information either one works. The product method might be better when the data points are more closely clustered and the differences between them will be smaller. here I'll be using the additive method.

Now comes the actual analysis. Let's see an example of a heat map generated using this equation, my test settings, and a set of data points from a series of crimes in Atlanta, Georgia in the early 1980's that I pulled from the interwebs

The black circles represent locations where bodies attributed to the same offender were recovered. The green star near the center of of the map is where the offender's home was later identified. The color gradient goes from black as the lowest probability areas all the way up to white as the highest probability areas. We can examine the same data with Rossmo's settings as well

As you can see, my lower values for f and g causes the probability values to decay slower, spread a bit farther, and create a more blended transition. Rossmo's settings create a much more defined transition between probability areas and you can clearly see the effect of the buffer zone around each crime location. We'll analyze the results for the offenders location in more detail shortly but for now let's just get a sense of how the algorithm may be applied in the real world. To do that we can look at the hot zones mapped over the geographic area, as an investigator might. First, using my test settings


And again using Rossmo's reference values:

Mean Location Score

Alright, enough messing around with pretty pictures. Let's ask some serious questions. First, how do we know if the points in the crime set really indicate some relationship to an anchor point? What if these points are really just random, what would the expected score for the offender's location be? Is that significantly different than the score we observe with these points? 

More formally we define the first hypothesis as "Crime locations chosen by an offender will generate a score for their anchor point which is significantly higher than the mean score of the same location with randomly generated points." The null hypothesis is then "The score for the chosen points will not be significantly larger than the mean score for randomly generated points." We can define a threshold of 95% to denote a significant difference. This means the relative value of the chosen points needs to be more than 2 standard deviations above the mean score for us to reject the null hypothesis.


Here the O stands for the offenders coordinates and prob (Ox, Oy) is the score of the offenders location, given the historically observed crime locations. The variable μ is the mean value of the score's for the offender's location given random points, and 𝝈 is the standard deviation of those scores.

To answer the question I ran one thousand simulations on the data, in 10 groups of 100 sample simulations each. For each simulation I randomly generated the same number of points as in the original crime data (26) within the same bounding area present in the current data. I then calculated the probability grid using Rossmo's formula and the randomly generated crime sites. I recorded the score of the offender's location as well as some other statistics that we'll cover more in a bit. Once each group of 100 simulations completed I calculated the mean score for that sample and repeated the process until I had 10 mean scores. I then averaged these to get a close approximation of the population mean and standard deviation. The mean score for the offender location when using random points was 7.45, or about 83.44% of the way up the score scale using my parameters. The standard deviation for the score was 0.285 points. Below you can see the expected score distribution. 

On average, the offender's current location score was in the 97th percentile of the simulation data, which means there is less than a 3% chance we'd observe a value as extreme (or more) than the observed score of  8.104 if the points were random.

I repeated the test with Rossmo's settings to see if the result was similar.

In this case the average score expected with random points is around 0.34, or about 61% up the scoring scale. The offender's location scored in the 98th percentile on average, or less than a 2% chance that we'd observe a value as extreme as the observed score of 0.506 with randomly selected points. 

In both tests the results indicate the algorithm scored the offenders location significantly higher than average, given the observed crime locations. We can conclude a score that high isn't likely to occur if the points are chosen at random and Rossmo's formula appears to work on this test case. Of course, this result only represents one test, and I'd be interested to see how these results hold up over a larger sample size

Hottest Zones

Another statistic I examined was the value of the highest score on the map. I'm curious if the value of the hottest zone's score can be used to indicate the presence of a non-random pattern. If there is a significant change in the highest score when using the observed points versus random points, this may be interesting for training a predictor to recognize when crime locations are likely to be related to one another.

With my settings the hottest zone on the map scored around 7.92 on average, with a standard deviation of only 0.26 points when using randomly selected points. With Rossmo's values I saw an average high score of about 0.54 and a standard deviation of 0.068 points (about 12.6%). With the historical points the map's hottest zone scored 8.41 and 0.698 respectively. In both cases we see an increase in the score assigned to the hottest zone. With my settings the increase is just under 2 standard deviations and with Rossmo's it's just over 2 standard deviations. So the results here seem to be a mixed bag. We can't make a conclusion about the impact of point selection on the hottest zone measurement.

Grid Resolution 

One area of the algorithm that we need to discuss is the impact of grid resolution on the results. The formula is continuous, it can calculate a probability given any arbitrary set of points. Our systems on the other hand, are digital and therefore limited. For example, the smallest unit your screen can produce is 1 pixel, Python is only capable of 17 decimal places of precision, and so on. To handle this in practice we create an imaginary grid over the area as I mentioned before. The density of the grid, that is the number of possible grid coordinates over the same geographic area, is a function of the minimum and maximum coordinates in the data, and a divisor we choose. In my sample data, the data spanned an area of approximately 664.07 square miles (15.689 miles * 42.328 miles) and I chose a grid size of 100 by 100. This means each grid square covers 0.066 square miles or in my test implementation each score assignment covers roughly 27 city blocks. On the positive side, this means we only need to calculate (100^2)*n probabilities and we can produce a result very quickly. On the down side, an area that size in an urban sprawl like Atlanta may include hundreds of individual addresses for businesses, homes, apartment building, etc., and there is no way to prioritize them within the grid square. If we decide to increase our resolution to 1000x1000 though, we cut the area down to just under 3 city blocks per grid tile but we increase the number of calculations to (1000^2)*n as well. In general I'd think you want to go for the highest resolution you can, but there is a limit on the accuracy of the heuristic, so perhaps a few city blocks at a time is a good resolution? Remember, the goal of the formula isn't to pinpoint a location, but to help prioritize areas to investigate. There is likely some diminishing returns going beyond a few city blocks in resolution

The other factor which contributes to grid resolution is the geographic area spanned by the crime sites. The same grid resolution of 100x100 over an area 1300 square miles would roughly double the geographic area covered by a single tile. Or said differently, you must use a higher density grid (such as 1000x1000) to maintain the same geographic coverage as the total area increases.

All this to say, there is no de-facto 'best' grid resolution and the analysis has to inform the choice about the resolution, so an analyst must apply some amount of domain knowledge when configuring this step.

Other Statistics

In a related paper (https://gis.smumn.edu/GradProjects/NeldnerR.pdf) Rachel Neldner used Rossmo's formula to analyze another data set. The paper gives really good background on the historically observed geographic relationship between criminals and their crime locations. The paper also references a taxonomy for classifying the type of criminal behavior which may help inform the geographic profile. Overall, I'd say it's definitely worth checking out if you're interested in a deeper analysis of the formula.

The paper also details several other statistics which can be useful during an analysis like center of minimum distance (CMD), mean center (MC), and mean nearest neighbor distance (MNND). These statistics aren't strictly part of Rossmo's formula (with the exception of the mean nearest neighbor, which is used to find the buffer radius). They have been used by researchers for a long time and are generally accepted measures though. Let's go over each in a bit more detail and discuss the results with the test data.

Mean Nearest Neighbor Distance

The mean nearest neighbor is a simple metric to calculate, but has been shown to be quite powerful in analyzing the relationship of crime locations. To calculate the metric we iterate over each point in the crime data. For each point we find the other point in the data which is closest to it using the Manhattan distance measure. We then take the mean of these minimum distances to final MNND. The mean nearest neighbor distance can be thought of as a measure of how randomly dispersed the crime sites are. It's been well studied that people are horrible randomness generators and what we think of as 'random' is really closer to 'evenly distributed'. We can infer then that when an offender goes to pick seemingly random points, they will likely end up being more dispersed and therefore have a significantly higher MNND than a truly random set of points. Conversely, if the crime sites exhibit clustering we can expect to see a MNND that is significantly lower than the range observed with random points. The question then is what is the expected range for a random MNND? Once we answer that, we can check if our data falls within this range.

During the previous tests I also calculated the MNND for each set of randomly generated points to generate an estimate of the probability density function. Below you can see a graph of the probability density function I calculated

Mean Nearest Neighbor Distance Probability Density Function

My test showed the mean nearest neighbor distance was just under 3.48 miles with a standard deviation of 0.027 miles. I chose an interval of 95% which means the lower bound would be around 3.43 miles. Anything lower than that and we can say the points are tending toward clustering. The upper bound is just under 3.54 miles so a distance greater than this indicates the points are tending toward being evenly dispersed. The test data points have an MNND of 2.87 miles so we can conclude these points are showing considerably more clustering than we'd expect to see from randomly selected points.

Center of Minimum Distance and Mean Center

These two are closely related statistics which find a central point with relation to the observed points. The simplest descriptor of a distribution is the mean center. The mean center of a group of points can be thought of as a point where both the sum of all differences between the mean X coordinate and all other X coordinates is zero and the sum of all differences between the mean y coordinate and all other Y coordinates is zero (Levine and Associates). According to Neldner and earlier researchers, the offender’s anchor point is often very near the mean center (Neldner, Paynich and Hill, 2010). During the simulation I calculated the mean center for the set of randomly generated points and then measured the distance to the offender's location. I wanted to see if the offenders distance from the mean center with the actual crime sites was significantly different from values one could expect to see with random points. Below is another probability density function calculated on the 1000 randomly generated mean center samples.

Distance to Mean Center Probability Density Function

As you can see, the observed score (the yellow vertical line in the middle) is very close to the mean score observed with random points. So, at least in this case, the distance to the mean center is no different than if the points had been selected at random over the same geographic region.

The center of minimum distance is the location on the map which minimizes the Manhattan distance to all the crime sites. According to the paper

the CMD of crimes committed by a single serial offender will usually be the best single predictor of where the offender lives (Levine and Associates). The CMD is another indicator of the offender’s anchor point (Paynich and Hill, 2010).
We can repeat the same analysis as we performed on the mean center data, only this time we only care about the distance being smaller than our lower bound.

Distance to Center of Minimum Distance Probability Density Function

The result clearly shows the offender lived much closer to the center of minimum distance for the actual crime sites, than one would expect to see if the points were randomly generated. We'd expect it to fall somewhere between 5.7 and 7.2 miles  but the CMD in this case was only 4.12 miles away from the offenders actual location. 

Final Thoughts

This has been a fun journey through one of the most famous geographic profiling formulas in the world. We've examined how the probability map is calculated, and saw an example of how one may apply it to a series of crime locations. Of course in this experiment only examined one case, using randomness as our baseline predictor. Therefore any conclusions we draw can only be thought of in the context of this set. Does Rossmo's formula increase the score of the offender's location in a significant way? In this case, it certainly appears so. With some other case, we can't say. With access to more data sets we may be able to expand on the concepts presented here to improve the generality of our assessment. 

So what is the future of a system like this? The modern state of the art is already well ahead of the simplified analysis performed here, using improvements in Machine Learning algorithms, with advancements in behavioral analysis. By combining multiple statistical indicators like the MNND and CMD  with discriminant analysis (LDA and QDA) and similarity measures (like Jaccard's index or the improved taxonomic similarity Δs) to create a model for classifying the type of offender which is most likely to generate an observed set of crimes, then using this information to  improve on the distance decay coefficients based on empirically observed values for offenders of similar types.

Similarity analysis can also be applied to the crime locations in an attempt to identify areas where an offender may choose in the future, based on their similarity to previously chosen sites. Combining other types of geographic information, such as the distance to major roads, businesses, or accessibility by a vehicle, can improve the chances of finding meaningful similarities between crime locations as well.

Of course, all of that relies on having a much larger set of crime data to work from. I'd love to run these tests against a large number of disparate cases to see how this analysis performs, but it is hard to find verifiable data sets which contain the geographic information required.



No comments:

Post a Comment