Tuesday, February 26, 2008

The latest episode of Bikini Statistics

Nobody, or very very few, would notice that this model is completely made up. The reason is that, in real life, each of these x’s would have a name attached to it. If, for example, y was the amount spent on travel in a year, then some x’s might be x7=”married or not”, x21=”number of kids”, and so on. It is just too easy to concoct a reasonable story after the fact to say, “Of course, x7 should be in the model: after all, married people take vacations differently than do single people.” You might even then go on to publish a paper in the Journal of Hospitality Trends showing “statistically significant” relationships between being married and travel model spent.

And you would be believed.

I wouldn’t believe you, however, until you showed me how your model performed on a set of new data, say from next year’s travel figures. But this is so rarely done that I have yet to run across an example of it. When was the last time anybody read an article in a sociological, psychological, etc., journal in which truly independent data is used to show how a previously built model performed well or failed? If any of my readers have seen this, please drop me a note: you will have made the equivalent of a cryptozoological find.

Incidentally, generating these spurious models is effortless. I didn’t go through 100s of simulations to find one that looked especially misleading. I did just one simulation. Using this stepwise procedure practically guarantees that you will find a “statistically significant” yet spurious model.

This sort of thing is why we're barraged with studies showing that almost everything will kill you--no, wait! they'll make you live forever!

Original BS quote here.

No comments:

Post a Comment