Benford's Law and More Statistics [UPDATED]

Inspired by a post I saw on Reddit, I decided to analyze the 2008 NH Primary Election results using Benford's Law.

Benford's Law basically states that if you take a data set, examine the first digit of each number and tallied all the 1's, 2's, 3's, and so forth, you would see far more 1's, 2's and 3's than 7's, 8's and 9's, as shown in the chart/graph below:

Chart courtesy of Journal Of Accountancy and graph courtesy of FiveThirtyEight

Benford's Law is a legitimate tool for detecting fraud and other anomalies in data. You can read a great article about the topic here.

Another part of Benford's Law says that the probability that you will see a 00,01,02...70,71,72...97,98, or 99 as the last 2 digit in numbers >= 100 is ~1%. The larger numbers become, the closer the probability will be to 1% (e.g. A 10 digit number's last 2 digits appearance frequency will be closer to 1% than a 3 digit number, but the probabilities are nearly the same.)

So how does this help us to detect fraud?

Psychologists have found that humans have trouble generating random numbers. Humans tend to repeat digits and have trouble selecting non-adjacent digits (such as 64 or 17, as opposed to 23) as frequently as one would expect in a sequence of random numbers.

Using polling data available from http://www.sos.nh.gov/presprim2008/index.htm, I tabulated the results for Edwards, Obama, Richardson, and Clinton for all NH counties except Coo's county. For Rockingham and Hillsborough counties, I tabulated the results before and after the recount (these were the only precincts that had democratic recount data). Results are posted below. You can download the attached spreadsheet for the dataset and more graphs:

With the exception of Obama and Richardson, all data points fall between the Lower and Upper Control Limits.


Looking at the last 2 digits, we see that nearly every data point falls within the control limits.

So what does this all tell us?

To be clear, the plots that lie outside the control limits does not signify fraud in and of itself. What this does signify however, is that we have encountered an "special case" scenario that warrants further investigation; that is, the resulting variance is greater than that of random "noise".

One of the rules of Benford's law is that it best follows power law distributions with no theoretical upper limit. The problem with our case is that US precincts are often divided by number of voters; Once a precinct is too large, they form a new one so voters do not have to stand in line forever.

If we create a histogram of the vote totals*, we end up with a graph that looks similar to a power-law distribution, but not exactly.

*Vote totals in chart are equal to total votes for Obama, Clinton, Edwards, and Richardson, which make up ~95% of the actual vote total.

AttachmentSize
nh results_06.xls732.5 KB

Comments

And now...

How about doing the same for the Bush v Gore results in Florida? Running this on the originally reported results would be great.

Thanks for the reply. I may

Thanks for the reply. I may do this in the future, but I think it has already been done using the second digit for the test. See this article.

From the report:

The test does not indicate problems for Florida in 2000. Regarding Ohio in 2004, the test does not overturn previous judgments that manipulation of reported vote totals did not determine the election outcome, but it does suggest there were significant problems in the state.