Tuesday, April 7, 2015

Data adjustments - not a conspiracy, just a part of empirical work in economics

I got an email today announcing an Urban seminar, and the abstract reminded me of some of the Piketty debates around Bob Murphy and Phil Magness's paper and subsequent discussions. Here it is:
"ABSTRACT: The 2014 Current Population Survey, Annual Social and Economic Supplement (CPS-ASEC) introduced major changes to the income questions. The questions were introduced in a split-sample design—with 3/8 of the sample asked the new questions and 5/8 asked the traditional questions. Census Bureau analysis of the 3/8 and 5/8 samples finds large increases in retirement, disability, and asset income and modest increases in Social Security and public assistance benefits under the new questions. However, despite the additional income, poverty rates are higher for children and the elderly in the sample asked the new questions. In this brownbag, we discuss the changes to the survey, the effects of the changes on retirement and other income, and describe how compositional differences among families with children in the 3/8 and 5/8 samples may explain the unexpectedly higher poverty rates in the 3/8 sample. The discussion has practical as well as theoretical importance, as researchers will have a choice of datasets to choose from when analyzing the 2014 CPS-ASEC data—the 3/8 sample weighted to national totals, the 5/8 sample weighted to national totals, a combined sample, and possibly also an additional file prepared by the Census Bureau that imputes certain income data to the 5/8 sample based on responses in the 3/8 sample."

The CPS is typically not used to address inequality for all sorts of reasons, including the nature of the questions, coverage, and top-coding. But it still has income questions, and note that a recent redesign changes asset income reports. Of course if we were to use the CPS to think about some of Piketty's research questions, this change would be important. Moreover, if  you wanted to use a consistent series from the CPS you would have to adjust the data to either move down the newer half of the series, or (probably preferably if this redesign represents an improvement) moving up the older half of the series. They do split samples discussed in the abstract so that you can understand the sort of adjustment that might be appropriate.

This is what Piketty is doing too when he harmonizes several of the wealth inequality series, and he uses years when the data series overlap to develop the adjustment factors. The figure Murphy and Magness like to call the "Frankenstein graph" suggests that certain blocks of the series come from different datasets, but in reality Piketty is typically taking data from several datasets to provide a harmonized estimate (for example, combining the Kopczuk and Saez data and the SCF data). This is how you'd want to merge several datasets, and it's generally not "pivoting" between datasets or "overstating" them as Murphy and Magness put it.

Anyone can criticize these sorts of data decisions, but it's a normal part of empirical work. If your criticism is just that the data decisions result in the conclusion that Piketty draws, that's not a very reasonable criticism. It's entirely circular: Piketty's conclusions are bad because his data decisions are bad. How do  you know his data decisions are bad? Because they correspond to his conclusions!


  1. You do yourself a real disservice, Daniel, when you make arguments such as these.

    M&M are fully aware that he's harmonizing different series. YOU just seem unwilling to admit that they've dug deeper into this one than you're comfortable discussing yourself.

    What M&M point out and what Pikkety's defenders ignore is that he does more than just harmonize data sets through "standard" stats practices. When the Kopczuk and Saez data starts to break from Pikkety's story he just deletes it and goes to the SCF. And when the SCF doesn't support him either he starts picking and choosing *which* SCF years he wants to include. That's not standard social science. That's data malpractice which he doesn't explain either in his footnotes.

    1. If you take a look at the other posts I've done, I have dug deeper and they really haven't dug all that deep. And their reaction does pretty much amount to this, which was why I shared the abstract.

      re: "When the Kopczuk and Saez data starts to break from Pikkety's story he just deletes it and goes to the SCF."

      No, this is M&M's account. People have discussed at length why SCF data is better. M&M don't engage that conversation.

      re: "And when the SCF doesn't support him either he starts picking and choosing *which* SCF years he wants to include."

      You think his choice between Wolff and Kennickell is because he likes the Kennickell outcomes better? Do you have a reason for thinking this other than that it's M&M's assertion?

      Let me guess - you've read M&M but you haven't bothered to actually at least glance through Wolff and Kennickell?

      I don't mean to be a jerk or anything, but when you come in here talking about what I'm unwilling to do and then you just repeat stuff like this I don't think I'm obligated to just lie down and take it.

      Take a look at Wolff and Kennickell, and then all I ask is at least try to articulate why I think the M&M treatment of "picking and choosing which SCF years" account is a bad one.

    2. "No, this is M&M's account. People have discussed at length why SCF data is better. M&M don't engage that conversation."

      What you're saying about the SCF is very inconsistent with what Allan Auerbach says about the SCF. And M&M seem to be in complete agreement with Auerbach.

      "According to the annotations in Piketty’s dataset, these values appear to have been
      averaged in a way that seems difficult to reconcile with the way in which they are presented in
      some of the key charts in the book. For instance, an annotation in the dataset implies that the
      1980 value is the average of the 1983 value in Wolff (1994) and the 1989 value from Kennickell
      (2009). Likewise, the annotation for the 1990 value implies that it is the average of the 1992,
      1995, and 1999 values from Kennickell (2001). These averaging schemes yield, respectively,
      average years of observation of 1986 and 1995. A representation of the values generated by this
      averaging scheme as point-in-time values for 1980 and 1990 thus seems highly-stylized.
      Though the disparity between the actual year of observation and the depicted year of
      observation is smaller than for 1980 and 1990, the same phenomenon also occurs in the most
      recent data shown by Piketty. His spreadsheet indicates that the average of observations from
      the years 2001 and 2004 form the observation for the year 2000 in his book. A single value from
      the year 2007 serves as the most recent 2010 observation. The latter choice creates the
      impression that the chart includes data after the recent financial crisis, but it does not."

    3. You need to be more specific on Auerbach. I may very well disagree with him. I thought the AEA presentation was very good, and I glanced through the paper but have not given it a close read.

      I agree completely with your quote through "thus seems highly stylized". After that I'd have to go back Piketty's data.

    4. OK, so I pulled up Piketty's tables and Kennickell (2011) to double check my memory of all this. So in case you don't get to Wolff, the reason Piketty switched from Wolff to Kennickell is that Wolff only has a forecast of inequality for 2009, he does not have the data. The whole point of Kennickell (2011) is that he's updating things with the second wave of the 2007 SCF panel, so he does have data on 2009. He finds there is no significant change in the wealth share. Of course, Piketty notes this.

      Now, on the question of why there's no actual 2009 number in the mix? I suspect he did it because the Kennickell number comes from a 2007-2009 panel and is not a new 2009 sample. In any case we know that, in fact, the 2010 figure does account for changes after the crisis because Piketty is relying on Kennickell's finding that there was no distributional shift from 2007 to 2009 using the panel.

      You might think he should just throw that 2009 panel number in there. I'd probably be inclined to agree with that. But you really don't have grounds to call this malpractice.

  2. I've been trying to figure out why the Wolff and Kennickell picture of the 90s differ but I can't find anything obvious and they don't cite each other so it's hard to compare (not surprising since they came out around the same time). It might be vehicles - I think Wolff excludes them and Kennickell includes them - but I'm not 100% sure.

    In any case none of this really changes the fact that the SCF shows inequality increasing.

    1. What kind of an increase are we talking about? Wolfe's Top 1% is 33.8% in 1983 & 34.6% in 2006 (last year before his 2009 # that's a forecast). Are you really saying that a diff. of 0.8% proves Pikkety?

    2. It's Piketty and Wolff, not Pikkety and Wolfe.

      Let's please unpack "proving Piketty" for a second.

      So Piketty shows only a 1% increase for the 1% from his synthetic 1990 point to his 2010 point (somewhat bigger for 1980 to 2010, but not real big, so that's certainly in line with the sort of very modest increase Wolff shows for this period, yes. It ought to be since Wolff and Kennickell are using the same data!

      The steeper increase is in the top 10% (mainly attributable to the 95th to 99th percentile). Wolff shows this. Kennickell shows this. Again, it would be weird if they didn't. For what it's worth S&Z show this too, although K&S don't.

      So what part of Piketty is supposed to be inconsistent with the broad evidence from the SCF?

      The biggest open question for me is why the 90s growth for the 1% in Wolff is so much steeper than for Kennickell. I don't know the answer to that question so I don't have a dog in that fight. In any case since that drops around 2000 it all nets out to be about the same change from the SCF. Wolff ≈ Kennickell ≈ Piketty on that.

  3. Kenickell shows only 1 clear increase from 30.2% to 34.6% in between 1992 and 1995 (the Clinton Years).

    But then he has 34.6% in 1995 and 34.5% in 2010. Where's his increase, cause that's actually a slight decrease?

    Neither of these support: "In any case none of this really changes the fact that the SCF shows inequality increasing."

    1. I guess I'm not really understanding your concern. Piketty's 1990 to 2010 increase for 1% is 0.9 percentage points, right? A little bigger for 1980 to 2010. These seem broadly in line with the Kennickell numbers and the Wolff numbers. Do you disagree?

      The big increase in inequality comes relative to the 1970s trough - the growth he shows is much steadier/slower once we get into the 90s. Do you disagree?

      The real story here is the K-S 1970s trough which cannot be confirmed in the SCF because there is no SCF for those years. That whole K&S vs. S&Z vs. SCF question is the key, not Wolff vs. Kennickell.

    2. Wolffe 1982 to 2006 is +0.8. Piketty 1980 (30.1%) to 2010 (33.8%) is +3.7. Over 4.5X what his source shows, no?

      Kenickell doesn't go back before 1989, correct? So we can't really compare what he did in 1980 to 2010, only 1990 to 2010. And Kenickell's increase is almost ALL 1992 to 1995, then 1995 to 2010 is actually -0.1, no?

      SO the real question is this - Where does Pikkety get +3.7 for 1980 to 1990?

      Clearly not Wolff. While you could say 1992-1995 in Kenickell is an increase it's only between 2 yeas of SCFs. Not entire 30 years P. stretches it out to. Pikkety's results don't match either source. M&M + Auerbach all seem to have a point that this looks like data malpractice.

    3. I mean Pikkety is +3.7 for 1980-2010.

  4. Piketty's 1980 to 2010 change comes from Kennickell, 1989 to 2007. He says in the note that he grabs 1983 from Wolff but he doesn't appear to, but I'm just glancing over all of this. What is not understood about where Piketty gets it from? What do you mean they don't match either source? They look right to me.

    It's not malpractice.

    Stop saying that.

    If you have a disagreement with Piketty to offer, make it. If all you have is that you don't know what Piketty's source is you shouldn't accuse him of malpractice until you do know what Piketty's source is. I have Wolff, Kennickell, and Piketty on my screen right now and I see where his numbers are coming from. I don't know what your confusion on this is.

    What I don't know is what is the source of the difference between Kennickell and Wolff, but the fact that two researchers came up with somewhat different answers to potentially different questions (are they measuring net worth the same? I don't know) does not mean Piketty is guilty of malpractice.


