A Replacement Level Baseball Blog

Sunday, January 27, 2008

Lowered Expectations

How to beat the odds and hit above your line drive rate.

One of the more interesting insights to come out of Clay Davenport's marriage of baseball analysis with computer science is this: take one hundred major leaguers who are "true" .270 hitters and give them all 200 simulated at-bats, and by luck alone at least one of them will hit .350 and another .190.

There is a lot of random variance in baseball, enough for all-stars to look like Texas Leaguers and triple-A rejects to look like home run kings given a short enough timeline. And enough that finding ways to separate luck from talent in a given season can be tremendously profitable for general managers.

One of the more popular ways of doing this is by comparing actual to expected batting average on balls in play (BABIP to eBABIP). The former, of course, is calculated by removing homeruns and strikeouts as outcomes of at-bats. So you take (H-HR)/(AB-HR-SO). The latter is estimated by taking a hitter's line drive rate (expressed as a decimal like .182) and adding to it a constant--usually .120--to get a number like .302.

So far as I can tell, the constant is what it is because it's equal to league average BABIP minus league average LD% over some significant stretch of time (over the last three years this number is .117). If there is a more complex way of deriving it, I'm not aware. The idea the eBABIP is intended to capture is that most of a player's hits are going to come off line drives, and that the number of hits a player gets off fly balls and grounders is fairly stable (since defensive efficiency for those outcomes is fairly stable.)

But eBABIP can also be used as an excuse for using generalities to draw fallacious conclusions about individual players, since there will always be some who consistently beat their BABIP expectations by substantive margins.

Well, so I claimed anyway. I figure its only fair to run some numbers and see whether I'm right or stupid. So what I did is this: I took all major leaguers with at least 100 plate appearances in each year from 2005 to 2007 and I calculated BABIP and eBABIP for all of them. I also collected about 100 other rate and count stats for these guys. Then I put the whole mess of it in a big ass spreadsheet and ran a bunch of statistical tests.

First off, I wondered if there was any year-to-year stability in the size of the residual (the difference between a hitter's expected and actual BABIP, expressed as a decimal). If there was, it would indicate that how well or poorly eBABIP predicts BABIP for a given player is not random, but a function of some disposition or capacity that player has. I say "disposition" or "capacity" instead of "skill" because I don't care whether the thing that determines a player's residual is useful, only whether it's tangible.

OK, so I ran three-year intraclass correlations on BABIP residuals and got an "R" number of .276. The R-number gives you something like the ratio of signal-to-noise in the sample. In this case it's just about 3-to-1 noise. Not great, but not terrible considering that R-number for BABIP itself over the same stretch is just .369 (as a point of comparison, the number for Home Runs is .675). Moreover, the residual is actually more stable than eBABIP itself, which came in at .267. Since I'm not a statistician, I'm not sure what eBABIP's instability really means besides the fact that LD% is itself unstable (and that's a freebie given that eBABIP just is LD% plus a constant).

Anyway, the upshot is there is at least something systematic in how well or poorly eBABIP predicts BABIP for a given player. And if that's so, it's then reasonable to infer that there are certain "types" of players for whom the eBABIP method is particularly accurate, and others for whom it is not.

So I took the absolute value of the residuals--the distance from the value zero, a perfect prediction--and ranked these by percentile. I then ran the the means, std. deviations, skews, etc. on the lower quartile (the most accurate eBABIP predictions) and the upper quartile (the least accurate eBABIP predictions). The descriptive statistics are here and here. Two things stood out.

1) The "good predictor" group are better hitters than the "bad predictor" group. The AVG/OBP/SLG for the former is .269/.336/.429, compared to .262/.326/.411 for the latter. The good predictors also walked more (BB/PA) and struck out less (SO/PA).

2) The good predictor group played more than the bad predictor group, averaging more games (111 vs. 84), plate appearances (430 vs. 304) and at-bats (383 vs. 271).

Surely (1) and (2) are related. Good hitters get more playing time than bad ones, and the smart money says that, for what we're interested in, the causal chain goes from (2) --> (1). In other words, it isn't that eBABIP fails for poor hitters, but that it fails for smaller sample sizes. (Though, when you take into account that the good predictors are almost a year older on average, it becomes a marginally trickier to guess which variable is doing most of the work, since older players tend to get more PAs than younger ones, regardless of talent). I should also say that the good predictors averaged higher VORP (15) than the bad predictors (8.92), which nicely parallels (1) and (2) since VORP takes both production and playing time into account.

So eBABIP can't get a handle on young, part-time and/or poor hitters, kinda sorta. None of this is really revelatory, nor even that helpful, in part because there is more work to do. Tomorrow, or when I get around to it, I'll run regressions between eBABIP and BABIP at different OPS and PA thresholds just to get a better idea of where on the output/playing time spectrums eBABIP does its best and worst work. I'll also run some numbers on players whose BABIPs were subsantially over- and under-estimated by eBABIP. Basically, instead of worrying about the absolute accuracy of eBABIP (as measured by the absolute value of the residual) I'll be looking at whether there is any pattern to the way eBABIP shortchanges some player's output and overblows others'.

But that's tomorrow.

No comments: