But even the small counties have populations in the several hundreds or thousand...

tmalsburg2 · on May 8, 2017

(See edit below.)

Intuitively what you say makes sense. However, I just ran a simulation in R based on census data (actual county sizes from [1]) and the result is that you do get by-county age differences of up to ~17 years just by chance. These bigger differences do not occur often but the assumptions I made in the simulation are fairly conservative and the true random variation could easily be bigger. So 20 years does not seem absurd.

[1] https://factfinder.census.gov/faces/tableservices/jsf/pages/...

In the spirit of open science, the code for the simulation:

  library(tidyverse)

  read_csv("co-est2016-alldata.csv") %>%
    filter(SUMLEV=="050") %>%
    select(county=CTYNAME, pop=POPESTIMATE2016) %>%
    group_by(county) %>%
    mutate(age = mean(rbinom(pop, 100, 0.7))) ->
    d

  range(d$age)

  ggplot(d, aes(age)) + geom_histogram()

Running the simulation repeatedly yields maximal differences between counties between 16 and 19 years.

EDIT: False alarm. I incorrectly assumed that census data sets uses unique identifiers for counties and that introduced a bug in the simulation. After correcting it, the simulation shows that random variation indeed explains only very small differences in life expectancy of about one year. Corrected simulation code below:

  read_csv("co-est2016-alldata.csv") %>%
    filter(SUMLEV=="050") %>%
    mutate(county_id = 1:n()) %>%
    select(county_id, county=CTYNAME, pop=POPESTIMATE2016) %>%
    group_by(county_id) %>%
    mutate(age = mean(rbinom(pop[1], 100, 0.7))) %>%
    ungroup -> d

  range(d$age)

Life expectancy as a function of county population: http://imgur.com/a/0LgGZ

gerdesj · on May 8, 2017

I think a big problem (and you highlight an outcome here) is that focussing on geo political units like county boundaries which have a huge variance in both area and population is always going to be a troublesome background on which to pin statistics.

One idea might be to distort the map so each county has a size proportional to its population - quite tricky around metropolitan areas! There might be a way to aggregate adjacent counties in some way.

Depending on the raw data granularity it might be best to dispense with political boundaries and say plot based on some form of population to area measure that smears things somewhat. That would probably be fair but wont please anyone with an axe to grind (ie everyone).

You should see the statistical contortions carried out here (UK) for similar bollocks. North/South divide? - where is the middle of the UK? Who knows? Does it include Scotland, Northern Ireland and Wales? Where exactly are the Midlands? Is Wiltshire in the South West or the South? Anyway you get the idea.

You end up spending more time explaining outliers and oddities than you do focussing on the real issues (whatever they are) with these kind of maps unless they are very, very carefully and rigorously put together. A fair and rigorous map will probably please no-one 8)

Oh and another thought - if your post (sorry) zip codes are involved in the raw data then as a previous article on HN showed they don't always map very well to the boundaries they purport to cover.

kevindqc · on May 9, 2017

How ZIP codes nearly masked the lead problem in Flint(2016)

https://news.ycombinator.com/item?id=14237184

Retric · on May 8, 2017

You also should use an actuarial table, because deaths become much more common with age shrinking the deviation significantly. However even one year variation shows there is something significant going on.

MR4D · on May 8, 2017

Thanks for sharing!

It would be interesting to know the standard deviation for counties less than say, 10,000 people, as well as the for counties larger than 10,000 people.

tmalsburg2 · on May 8, 2017

See the plot that I linked in the post above.

flexie · on May 9, 2017

Impressive!

generj · on May 8, 2017

You'd think that, but the US has about 724 deaths per 100,000. In a county with fewer than 10,000 people that leaves only about 72 deaths per year. Loving County in Texas has 112 residents. "35 counties have a population under 1,000; 307 counties have a population under 5,000; 709 counties have a population under 10,000" [0].

I haven't read the paper, so I don't know their exact methodology but if they just took a snapshot in 1980 and then in 2014 a county of 10,000 would only be looking at about 145 deaths total. That's small enough that it's pretty possible for outliers to have a big impact. In a county of 1,000 there are only 7 deaths a year, so if you only look at two years that's a total of 14 deaths. It doesn't take a lot to jostle the numbers when you don't have many observations.

In any case, this isn't the main contributor here, as pointed out elsewhere. But it is a good thing to be aware of whenever county level statistics are provided.

[0]https://en.wikipedia.org/wiki/County_(United_States)#Populat...

itissid · on May 8, 2017

It seems they are being smart about using counties: "All analyses were carried out at the county level. Counties were combined as needed to create stable units of analysis over the period 1980 to 2014, reducing the number of areas analyzed from 3142 to 3110 (eTable 1 in the Supplement). For simplicity, these units are referred to as “counties” throughout."

They also use https://en.wikipedia.org/wiki/Small_area_estimation which seems like a way to deal with the counties with small sizes. And from what I can tell they are using hierarchical models that should have some regularizing effect from larger populations.