One possible partial explanation for this is the same reason why the Bill Gates ...

Nacraile · on May 8, 2017

Sample-size-based variance is a nice nit to pick, but it doesn't really seem to match the data presented in the article. Outcomes are strongly correlated in adjacent counties across the map, which is not predicted by sample-size (which would produce uncorrelated noise, in inverse proportion to population density). On the other hand, there is substantial correlation between the given geographic distribution of life expectancy changes and geographic distribution of wealth, as observed in the article.

generj · on May 8, 2017

Yes, this is correct. Especially with how extreme the values are within the clusters (like in Kentucky), there is almost no chance they occurred by sample-size variance.

Sample-size variance might explain some of those little counties with blips out in OK, MS, and TN though. As an aside, the color scheme on the map makes any decrease look substantially more massive.

I wish we had a map which was recentered on the national average growth of life expectancy as well.

PeterisP · on May 8, 2017

The largest life expectancy increase is in New York county, which has a population of 1.6 million, comparable to whole smallish countries - no, this is not explained by high variance in small counties.

flexie · on May 8, 2017

But even the small counties have populations in the several hundreds or thousands, right? Outliers that die at 33 or 103 affect the average little when you have a population of 2,000.

tmalsburg2 · on May 8, 2017

(See edit below.)

Intuitively what you say makes sense. However, I just ran a simulation in R based on census data (actual county sizes from [1]) and the result is that you do get by-county age differences of up to ~17 years just by chance. These bigger differences do not occur often but the assumptions I made in the simulation are fairly conservative and the true random variation could easily be bigger. So 20 years does not seem absurd.

[1] https://factfinder.census.gov/faces/tableservices/jsf/pages/...

In the spirit of open science, the code for the simulation:

  library(tidyverse)

  read_csv("co-est2016-alldata.csv") %>%
    filter(SUMLEV=="050") %>%
    select(county=CTYNAME, pop=POPESTIMATE2016) %>%
    group_by(county) %>%
    mutate(age = mean(rbinom(pop, 100, 0.7))) ->
    d

  range(d$age)

  ggplot(d, aes(age)) + geom_histogram()

Running the simulation repeatedly yields maximal differences between counties between 16 and 19 years.

EDIT: False alarm. I incorrectly assumed that census data sets uses unique identifiers for counties and that introduced a bug in the simulation. After correcting it, the simulation shows that random variation indeed explains only very small differences in life expectancy of about one year. Corrected simulation code below:

  read_csv("co-est2016-alldata.csv") %>%
    filter(SUMLEV=="050") %>%
    mutate(county_id = 1:n()) %>%
    select(county_id, county=CTYNAME, pop=POPESTIMATE2016) %>%
    group_by(county_id) %>%
    mutate(age = mean(rbinom(pop[1], 100, 0.7))) %>%
    ungroup -> d

  range(d$age)

Life expectancy as a function of county population: http://imgur.com/a/0LgGZ

gerdesj · on May 8, 2017

I think a big problem (and you highlight an outcome here) is that focussing on geo political units like county boundaries which have a huge variance in both area and population is always going to be a troublesome background on which to pin statistics.

One idea might be to distort the map so each county has a size proportional to its population - quite tricky around metropolitan areas! There might be a way to aggregate adjacent counties in some way.

Depending on the raw data granularity it might be best to dispense with political boundaries and say plot based on some form of population to area measure that smears things somewhat. That would probably be fair but wont please anyone with an axe to grind (ie everyone).

You should see the statistical contortions carried out here (UK) for similar bollocks. North/South divide? - where is the middle of the UK? Who knows? Does it include Scotland, Northern Ireland and Wales? Where exactly are the Midlands? Is Wiltshire in the South West or the South? Anyway you get the idea.

You end up spending more time explaining outliers and oddities than you do focussing on the real issues (whatever they are) with these kind of maps unless they are very, very carefully and rigorously put together. A fair and rigorous map will probably please no-one 8)

Oh and another thought - if your post (sorry) zip codes are involved in the raw data then as a previous article on HN showed they don't always map very well to the boundaries they purport to cover.

kevindqc · on May 9, 2017

How ZIP codes nearly masked the lead problem in Flint(2016)

https://news.ycombinator.com/item?id=14237184

Retric · on May 8, 2017

You also should use an actuarial table, because deaths become much more common with age shrinking the deviation significantly. However even one year variation shows there is something significant going on.

MR4D · on May 8, 2017

Thanks for sharing!

It would be interesting to know the standard deviation for counties less than say, 10,000 people, as well as the for counties larger than 10,000 people.

tmalsburg2 · on May 8, 2017

See the plot that I linked in the post above.

flexie · on May 9, 2017

Impressive!

generj · on May 8, 2017

You'd think that, but the US has about 724 deaths per 100,000. In a county with fewer than 10,000 people that leaves only about 72 deaths per year. Loving County in Texas has 112 residents. "35 counties have a population under 1,000; 307 counties have a population under 5,000; 709 counties have a population under 10,000" [0].

I haven't read the paper, so I don't know their exact methodology but if they just took a snapshot in 1980 and then in 2014 a county of 10,000 would only be looking at about 145 deaths total. That's small enough that it's pretty possible for outliers to have a big impact. In a county of 1,000 there are only 7 deaths a year, so if you only look at two years that's a total of 14 deaths. It doesn't take a lot to jostle the numbers when you don't have many observations.

In any case, this isn't the main contributor here, as pointed out elsewhere. But it is a good thing to be aware of whenever county level statistics are provided.

[0]https://en.wikipedia.org/wiki/County_(United_States)#Populat...

itissid · on May 8, 2017

It seems they are being smart about using counties: "All analyses were carried out at the county level. Counties were combined as needed to create stable units of analysis over the period 1980 to 2014, reducing the number of areas analyzed from 3142 to 3110 (eTable 1 in the Supplement). For simplicity, these units are referred to as “counties” throughout."

They also use https://en.wikipedia.org/wiki/Small_area_estimation which seems like a way to deal with the counties with small sizes. And from what I can tell they are using hierarchical models that should have some regularizing effect from larger populations.

Perceval · on May 9, 2017

Here's a post on how to address trying to map the abnormal while controlling for variance across sample sets: https://medium.com/@uwdata/surprise-maps-showing-the-unexpec...

aaron695 · on May 9, 2017

Failure != Wasted

misiti3780 · on May 8, 2017

i was ready to add that same comment.