Wednesday, April 17, 2013

Coding errors and the smell test

I just found a coding error in some stuff I did last night, which is not surprising because I was burning the midnight oil.

It was pretty obvious - some people had GPA's of over 20.

That's not right.

Turns out in one line of code instead of weighting the average grade in a given field in the first semester by the number of credits in that semester, I weighted the credits by the credits - so a few people that took a lot of classes in one field in their first semester of college (that's not a lot of people, but some) got big GPAs.

I found the coding error quickly with some descriptive checks (always important), but most importantly because I knew what results made sense and what didn't. I was looking to make sure my GPAs ranged between zero and four (they're standardized to that range). Now they do.

Coding errors are easy to find if you can apply the smell test to your results. Multipliers of 5 do not make sense, even to the most ardent Keynesian. Negative multipliers ought to give the most ardent libertarian pause. This is why we do lots of studies on things like supply and demand elasticities - because they give us a feel for what makes sense or not.

And if it doesn't something's wrong. Maybe it's a statistical artifact, but maybe you did something wrong.

What does this mean for Reinhart and Rogoff? I don't know. Their result isn't wildly implausible particularly since in their own paper they always acknowledged the feedback between economic performance and debt. Maybe the big tipping point result was a little less plausible. At least the tipping point should be conditional on some other factor associated with creditworthiness or the monetary system (England during the Napoleonic Wars, the U.S. after WWII, Japan today, Belgium, etc. ought to give us that intuition).

The smell test probably isn't quite as cut and dry as it was for me this morning when I knew something was wrong, but the point still remains: good theory and a good sense of how the world works fosters good empirics, which in turn contribute to a better understanding of how the world works. And perhaps most importantly, this iterative relationship between the quality of theory and the quality of empirics opens the door to multiple equilibria, including good and bad scientific equilibria.

6 comments:

  1. Do you write unit tests and get code reviews? That's best practices in the software development industry and IME it makes a huge difference. I'm not sure whether it's feasible within an academic environment.

    ReplyDelete
    Replies
    1. So from what I understand of those practices, I don't really do stuff that's generally amenable to unit tests. I mostly organize, manipulate, and recode data and then code the statistical analysis. So there's not really a module that would be unit tested (if I'm understanding the term right). Some people do write that sort of stuff (I'm using a stata procedure in this paper that has been produced by an economist and made available for download). I don't really do that, but perhaps I will someday.

      Simply in terms of code review, we did a lot of that at the Urban Institute where research was done in bigger teams. I'm guessing that's less common in academia, but when you've go a lot of research assistants and cooperating professors its more common.

      More informally, I've looked over another guy's code for this Sloan project.

      This particular case is a class paper so it's sort of all on me :)

      Delete
    2. What you're doing isn't so amenable to unit testing, but I think it's wise to try to do something.

      However, it's worth trying it in my opinion. Here's the problem, let's say you start out with a data set and perform a few transformations on it. You then record that intermediate set and perform a few more operations, finally you extract the statistics you need. Often bugs can occur which embed mistakes into the intermediate data-set. That can be very hard to catch. (Lots of problems with the HADCRU climate-change data and their program were caused by this, whether they changed the overall prediction though is up for debate.)

      It's better to keep the original input data set, and perhaps several dummy variants of it. Then go through the whole transformation process several times. This helps shake out problems, I've done it before when I've had to do complex data manipulation. This requires automating the whole process, which I recommend doing. That way you can easily apply it to several test input data sets.

      Delete
    3. Unit testing is't really something that makes sense if you're running regressions in Stata, or if you're making a spreadsheet in Excel.

      (I'd argue that calling what Reinhart and Rogoff did a "coding error" makes it seem more excusable than it actually is. This wasn't a like some unclosed parentheses buried deep in an object's definition, which got through their testing procedures. They just didn't add a bunch of data points when they calculated the most important summary statistic in the paper.)

      The general problem isn't about coding or entering the right commands when you're using a programme. It's that to err is human. We all do the same sort of stuff when we're solving a maths problem, for example. An elementary mistake early on in your solution--you drop a variable's sign, say--will show up some time later when you realise that your answer is wrong. Why would you realise that your answer is wrong? You look to see if the answer makes sense, and then you check.

      If Reinhart and Rogoff had just checked to make sure that this statistic, which seemed to show a major jump or structural break at 90% debt/GDP levels, was actually right, they could have saved themselves a great deal of embarrassment.

      Delete
    4. In other words, exactly what you said in your post, Daniel.

      Delete
  2. Regarding practical measures to deal with this sort of stuff, I think you ideally want several people doing the same analysis independently and then pooling results. Then reproducibility is baked in from the outset. Since that's the ideal, there's presumably an efficient compromise that balances time or money constraints or whatever. I like the idea of using things like R's Sweave and Emacs' Org-babel as well.

    ReplyDelete

All anonymous comments will be deleted. Consistent pseudonyms are fine.