Why Benford’s Law Works and How to do Digit Analysis on Spreadsheets

by Stephen J. Huxley

McLaren School of Business

University of San Francisco

 


Abstract

Digit analysis is a new weapon in the arsenal of auditors and others seeking to verify the authenticity of a list of numbers.  These numbers may  represent transactions, income statement figures, or stock market movements. Benford’s Law states that the first digits in any list of numbers that have not been generated by some systematic process follow a logarithmic rather than a uniform pattern, with 1’s accounting for 30 percent of all leading digits, 2’s about 17 percent, and so forth, declining to about five percent for 9’s. This paper provides several explanations of why Benford’s Law works and extends it to other forms of digit analysis.

 

Introduction

What is Digit Analysis?

Benford's Law, named after Frank Benford, a physicist working for GE who was the first in 1938 [1] to popularize the discovery that leading digits do not follow a uniform distribution as suggested by intuition.  Decades later in 1988, Carslaw [3] applied Benford’s Law to a set of financial statements and discovered evidence of misstatements.  Since Carslaw, Benford’s Law and its extensions have blossomed into the latest tool for auditors trying to detect evidence of fraud or irregularities in accounting systems [7], for investment

 

analysts searching for patterns in stock market returns [5], and even for religious scholars attempting to validate biblical accounts of events that happened thousands of years ago [9]. Nigrini and Mettermaier [7] coined the name “digit analysis” and defined as any analytical procedure which examines the digit and number patterns to detect abnormal recurrences or patterns of digits, digit combinations, or specific numbers.  Expected patterns for digits now include: 1) first digits, 2) second digits, 3) first two digits, 4) last two digits, 5) rounded digits, and 6) duplicates.

 

What is Benford’s Law?

Benford’s “Law of Anomalous Numbers” states that the distribution of first digits of numbers generated by no known systematic process follows a logarithmic pattern:

 

Probability (1st digit = n) = log(1+1/n)                             (1)

 

This function leads to expected frequencies that monotonically decline from about 30 percent for 1’s down to about five percent for 9’s.  Similar functions have now been developed for the distribution of second up to forth digit, but the distribution flattens out to uniformity for the fifth and greater digits (see Table 1).


 

 

 

 

 

 

Table 1

Distribution of Digits

 

 

 

 

 

 

Digit

1st

2nd

3rd

4th

5th or Greater

0

 

11.97%

10.18%

10.02%

10.00%

1

30.10%

11.39%

10.14%

10.01%

10.00%

2

17.61%

10.88%

10.10%

10.01%

10.00%

3

12.49%

10.43%

10.06%

10.01%

10.00%

4

9.69%

10.03%

10.02%

10.00%

10.00%

5

7.92%

9.67%

9.98%

10.00%

10.00%

6

6.69%

9.34%

9.94%

9.99%

10.00%

7

5.80%

9.04%

9.90%

9.99%

10.00%

8

5.12%

8.76%

9.86%

9.99%

10.00%

9

4.58%

8.50%

9.83%

9.98%

10.00%

 

100.00%

100.00%

100.00%

100.00%

100.00%


Why Benford’s Law Works

A Calculus Explanation

If we were to pick any number in the finite range from 1 to x, the chances of selecting any specific number would be 1/x.  For the numbers from 1 to 999, the chances of picking any particular number would be 1/999.  Any list of "real world" numbers that have not been generated artificially by some process represents a random sample of numbers between 0 and the largest number on the list, x.  The checks written by a company would represent an example of such a list. 

 

The area underneath this density function between any two points a and b yields the probability of getting a value lying between a and b.  This is precisely the same, of course, as calculating probabilities of normally distributed random variables.  Only in this case, the density function is simply 1/x rather than the complicated mathematical equation discovered by Gauss that describes the normal curve.  From integral calculus, we know that the probability of getting a value between a and b is the area under the distribution ‘curve’ between a and b and is derived as

  b      

ò a(1/x) dx = ln(b) - ln(a) = ln(b/a)                                       (2)

 

Proving this is beyond the scope of this paper; most mathematicians treat it as a definition, but Feller [4] attempts a proof.  For single digit numbers, if a = n and b = the next digit n+1, then the probability  the digit equals n is ln[(n+1)/n] = ln(1+1/n).  Raimi [8] shows that because natural logs are simply a scalar multiple of 10-based logs (i.e. ln(a) =  ln(10)* log(a) = 2.3026 log(a)), percentages will be the same whether ln or log is used.  The reason Benford uses logs to the base 10 rather than the natural logs with the base e is based on the principle that we use a numbering system based on 10.  Raimi provides an excellent review and discussion of the mathematical treatment of the problem.

 

Cards in the Hat Explanation  

Perhaps the easiest explanation why probabilities decline from one digit to the next is provided by Weaver [10].  Assume we write numbers on cards starting with 1 and ending with 999,999, each number getting a separate card.  Let P = the probability of getting 1, 2, 3, or 4 as the leading digit.  Intuition suggests P = 4/9.

 

As we number the individual cards, we begin to place them one at a time into a hat in ascending order starting with 1.  After each card, we ask the question “What is the probability, P, that a card picked at random from the hat at this point in time will have a leading digit of 1, 2, 3, or 4?”  The answer would be P = 100 percent for the first four cards, of course.  After the fifth card, P would drop to 80 percent.  After the sixth card, it would drop to 66.7 percent.  After the ninth card, it would drop to 4/9 or 44.4 percent, our overall intuitive level based on the entire batch.

 

After the tenth card, however, P would rise to 50 percent since five of the ten cards have a leading digit of 1 through 4.  Through 19, P rises 73.7 percent since 11 of the leading digits would be 1, and the initial 2, 3, and 4 are still in the hat, making 14 of the 19 cards winners.  As cards 20 through 49 are added, P increases steadily to a maximum of 44/49 or 89.80 percent.

 

As more cards are added beginning with 50, P then declines steadily, reaching a minimum at the 99th card of 44/99 or 44.4 percent again.  With the addition of the 100 card, P begins to rise once more, and will continue to do so until 499 is reached.  At that point, 444/499 cards, 88.98 percent, meet the criteria.  At 500, P will begin to fall, ultimately back to 444/999 or about 44.4 percent again. 

 

A key observation here is that each time we pass a turning point in our calculations, the length or span of cards we must cover to reach the next turning point gets longer and longer in real terms.  To reach 10, it  took only five additional cards after 4.  To reach 50, it took 40; to reach 500, it took 400, etc.  To reach 500,000 it took 400,000 additional cards.  If the maximum check written is a six digit check, it will have to be over $500,000 to begin to offset the lead that the lower digits have built up to that point.  Because of the way we count, 1, 2, 3, and 4 will always be in the lead.  That is, their probabilities will be always be higher unless the number of checks over $500,000 is greater than the number below $500,000, which would be rare.  Therefore, the lower digits 1, 2, 3, and 4 will nearly always have a higher probability of occurring as leading digits than 5 to 9.

 

Working Backwards Explanation

Another intuitive explanation is to consider the largest number in the data set we are dealing with and work backwards.  Assume the numbers in the data set represent checks written against an account, and consider the largest check that a company writes.  Few companies write many checks in the seven-digit range, that is, over $1 million,  but if a check is in the seven-digit range, it seems reasonable that the chances are it will be closer to $1 million than $9,999,999.  The same would be true for six digit checks (probably more are closer to $100,000 than $999,999).  As we work our way backwards, the same holds true (but probably not to the same extent) for five digit checks, and even four digit checks.  For any digit range in which it is plausible that smaller checks are more likely than larger checks, we would expect to see more 1’s, 2’s, 3’s, 4’s and 5's as leading digits than 6’s, 7’s, 8’s, or  9’s.

 

It is important to understand the elements of this explanation:  1) leading digits inherently involve sequencing because 1 is smaller than 2, 2 is smaller than 3, etc.; 2) because of the way we count, l always get a head start in terms of being the leading digit whether we are dealing with two digit numbers, three digit numbers, four digit numbers, etc.; 2 gets to start next, then 3, and so forth; 3) the expected value of the largest number we are dealing with in any finite table will lie in the center of the ending digit range; and 4) in the real world, we are always using numbers that have a miniscule finite digit length compared to the entire range of possible numbers from zero to infinity. 

 

Taken together, these facts lead to the conclusion that the leading significant digits we use are more likely to be 1 than 2, 2 more likely than 3, etc. 

 

Digit Analysis on Spreadsheets

The foregoing explains why Benford’s Law works for the first digit.  Similar explanations can be made for the second, third, and other digits, but the logarithmic effect is diluted each time.  Discernable deviations from a uniform begin to disappear rapidly after the second digit as indicated in Table 1.  Nigrini [7] provide details for the second and first two digits distributions. 

 

Armed with a knowledge of Benford’s Law, anyone can check a column of numbers on Excel by simply making use of the “mid” worksheet function.  This command slices one or more characters off a cell entry starting from any point specified.  Assume the number 975 appeared in cell B8, for example.  The command =mid(B8,2,1) would return the digit 7 because B8 identifies cell B8, the 2 says start with the second digit, and the 1 says print one digit.  Other examples:  mid(B8,1,1) would return the digit 9; mid(B8, 3,1) would return the digit 5; and mid(B8,1,2) would return the two digits 97.

 

Once the numbers have been parsed into their respective digits, the rest  of the analysis follows a normal pattern for  statistical significance testing.  The chi-square, t, and Kolmogorov-Smirnov are the most commonly applied tests.

 

Rounded Ending Numbers

Rounding of the final digit is checked by simply comparing the actual frequencies of multiples of the ending digits 10, 25, 100, and 1000 to their theoretical expectancies (.10, .04, .01, and .001, respectively).  Rounding often indicates estimation, which may be inappropriate for taxable items such as sales or inventory counts.

 

Analysis of the last two digits proceeds in a similar manner.  Numbers containing four or more digits should have the last two follow a uniform distribution of one percent each.  Final digits on Excel are captured by sorting the original data into an descending array, then using the mid worksheet function as explained earlier for the various digit ranges separately.

 

Duplicates

Duplicate numbers on Excel are found by simply sorting the column of numbers first, then subtracting subsequent entries.  Zero differences mean duplicate entries. 

 

The simplest method to estimate the expected number of duplicates to expect is based on the well known birthday problem.  If you have 100 people in a room, how many matching birthdays do you expect to get?  The most common solution is to assume a uniform distribution, so that the probability of any single pair of people having identical birthdays is 1/365.  How many different ways can you pair up the 100 people?  That is, how many different combinations of 2 can you make from n choices?  The answer is n!/2!(n-2)! = 4,950.  Multiplying 4,950 possible pairs times 1/365 yields 13.56 as the expected number of matching birthdays. 

 

References

1)       Benford, Frank (1938) “The Law of Anomalous Numbers,”  Proceedings of the American Philosophy Society 78, 551-572.

2)       Berton, Lee (1995) “He’s Got Their Number:  Scholar Uses Math to Foil Financial Fraud,” The Wall Street Journal, July 10, 1995.

3)       Carslaw, Charles (1988) “Anomalies in Income Numbers:  Evidence of Goal Oriented Behavior,” The Accounting Review LXIII, No. 2, 321-327.

4)       Feller, William (1966) Introduction to Probability Theory, Volume 2, Wiley, New York 61-62.

5)       Ley, Eduardo (1996) "On the Peculiar Distribution of the U.S. Stock Indexes' Digits," The American Statistician 50, No. 4, 311-313.

6)       Newcomb, Simon (1881) “Note on the Frequency of Use of the Different Digits in Natural Numbers,” American Journal of Mathematics 4, 39-40.

7)       Nigrini, Mark J., and Metternaier, Linda J. (1997) “The Use of Benford’s Law as an Aid in Analytical Procedures,” Auditing: a Journal of Practice and Theory 16, No. 2, Fall, 52-67.

8)       Raimi, Ralph (1976) “The First Digit Problem,” American Mathematical Review 83, 521-538.

9)       Salsburg, David (1997) “Digit Preferences in the Bible,” Chance 10, No. 4, 46-48.

10)    Weaver, Warren (1963) Lady Luck:  The Theory of Probability, Doubleday, Anchor Series, New York, 270-277.