Details
Benford’s Law refers to the frequency distribution of first digits in data (e.g., the first digit in the numbers 3, 37, 311 and 3500 is 3 in each case). While you may expect that the digits 1 to 9 appear with approximately equal frequency in the first position, it is often not the case – for many data sets the first digit is 1 about 30% of the time, it is 2 about 17% of the time, and decreasing frequency down to 9 which appears about 5% of the time. As summarized by Wikipedia, this applies to electricity bills, street addresses, stock prices, population numbers, death rates, lengths of rivers, and so on. The reasons for this phenomenon are not easily explained, although if you read the Wikipedia entry you may follow along.
I came across this very interesting blog on using “Benford’s Law” to check survey quality. At HealthBridge we carry out many surveys and ensuring quality of the data is an ongoing concern. How we can separate the good quality data from bad quality data? There are standard checks for data quality, such as ensuring that reported ages are in the expected range, or checking for consistency between questions (e.g., only females should be pregnant, only children should be in primary school). With dietary data, with which I most often work, there are no clear cut checks. We have expectations that dietary energy intakes are approximately equal to requirements, but there is no absolute rule. While individuals should not have energy intake of 0 kcals, they may when they are sick. They would not normally have 10,000 kcal in a day, but they may on a party day. However when the adult men in the sample have average intakes of 1700 kcals, you can be pretty sure that there is under-reporting somewhere in the dataset, but how can it be culled out from the good data?
When I heard about Benford’s Law I became hopeful that, finally, here was a tool that could separate the good dietary data from the bad.
Benford’s Law most often applies to distributions that spread across many orders of magnitude. In dietary surveys the most likely candidate is vitamin A intake, measured in International Units, which ranges from 0 to tens of thousands and even over 100,000. So I looked at the distribution of vitamin A in three dietary surveys I have been a part of in the past few years. For different reasons, I felt that one of the surveys was done a little better than the others (“best”) and one was poorly done (“worst”).
The results are shown in the following figure. Indeed the one I felt was best was the closest to having 30% of the first digits as 1, but the others were not that far off. Do they actually deviate from Benford’s law in a meaningful way?
(Note the small frequency of 0. Occasionally vitamin A intake will be zero. I haven’t read if distributions that include 0 would follow Benford’s law, but it seems minimally disruptive in these cases.)
In the survey that was the “worst” case above, there were a few interviewers who seemed to be worse than the others, with their subjects frequently having unrealistically low energy intakes. So I plotted the percent of first digits=1 (where the “right” answer is 30%) versus the average energy intake in adult women (where the “right” answer is harder to define but should probably be over 2000 kcal) for each of the 25 interviewers. While those with the highest intakes did have close to 30% with first digit=1, so did some of the lower intakes. So not really definitive…
I looked at the data in a few different ways. One way was to plot the ratio of first digit = 1 to first digit = 9 (where the “right answer” is about 6). Again, the interviewers with the two higher (probably more accurate) energy intakes had ratios close to six, but so did some of the other lower (probably less accurate) energy intakes.
(Four interviewers had 0% with first digit=9, and so the ratio for them is infinite, but they are plotted on this graph at 18 as place holders.)
Nutrient intakes are calculated as the reported weight of the food consumed multiplied by the amount of nutrient estimated to be in that food. Perhaps it would be more revealing to do the analyses with just the weight of the foods (which is what the interviewer actually recorded and may be a better indicator of her ability… but no, I got the same sort of results (results not shown here).
I feel I am close to something useful in applying Benford's Law to dietary data, but I am not there yet. Maybe analysis of more nutrients from more data sets would be revealing, but it would take quite an effort. Any ideas anybody?