A Replacement Level Baseball Blog

Monday, January 28, 2008

Lowered Expectations, Part II

In my last post, I looked at eBABIP (LD% + .117) as a predictor of actual BABIP, and my main findings were:

-There is some year-to-year stability in size of the residual between eBABIP and BABIP for a given player.

-eBABIP is slightly more effective on good hitters with relatively more playing time, and slightly less effective on below average hitters with fewer PAs. Specifically:

Best prediction quartile (means): .269/.336/.429, 430 plate appearances
Worst prediction quartile (means): .262/.326/.411, 304 plate appearances
League 2005-2007 (min. 100PA): .267/.333/.421, 391 plate appearances

Today, I said I'd run some regressions to get a more precise idea where eBABIP does its best predictive work. So I looked at playing time by 50 plate appearance intervals, and I looked at offensive production by OPS in .20 intervals (I'd use EqA, but I couldn't find numbers for 2005-2006). Here's how it went.



As you can see, there is a big jump in predictive power around the .800 OPS line, with the highest R-square values coming from players with at least a .900 OPS (rare enough territory). In terms of plate appearances, there is a more steady increase in predictability starting around the 300 PA threshold and peaking at 500 PA.

From this, one would figure that the best test group for eBABIP would be .900 OPS guys with at least 500 PAs. Turns out that's true. The R-square for the eBABIP/BABIP regression in those cases is .255, which is the highest value I found using these two parameters.

This certainly confirms the quick and dirty hypothesizing we did using the means of the upper and lower quartiles by residual. eBABIP works best on good hitters with a lot of playing time. Still, when you regress the residual with PA and OPS, the latter account for only about 14% of the variance in the former.

Over/Underestimated Players by BABIP

In my last post I also said that I'd eventually look at the real value as opposed to the absolute value of the residual. This number tells you not only how far off eBABIP is, but whether it over or underestimates a player's production. Think of it as giving a vector instead of a magnitude.

So, I ran Pearson correlations between the residual and about 20 different rate and count stats. I found statistically significant relations with many--the highest was with batting average, which came in at. -.490 (meaning that players with higher batting averages tend to have lower residuals and vice versa). OBP and SLG had similar negative relationships at -.358 and -.260 respectively. There was also a slight negative relationship with SO Rate (-.180). But the correlations that really caught my eye were three "speed" numbers: specifically SB (-.226), 3B (-.224) and GB% (-.355). If you don't see my leap of logic in including ground ball percentage as a speed-related number, read on.

In interpreting these relationships, remember that I'm using the actual value of the residual. So, for instance, the residual's negative relationship with stolen bases, triples, and ground ball percentage tells yout that guys with higher rates tend to be underestimated by eBABIP--that is, they tend to have negative residuals. Conversely, guys with lower SB, 3B, and GB% numbers tend to be overestimated by eBABIP--they tend to have positive residuals. What this tells me is that eBABIP underestimates fleet-footed, punch-and-judy types whose ability to beat out ground balls and take extra bases accounts for a large part of their offensive value. Likewise, eBABIP overrates line-drive hitters who are slow on their feet and don't get their share of ground ball hits. I could get a much better idea of the correlation here by computing Bill James' speed score and pitting it against the residuals, but they don't pay me enough (i.e., anything at all) to do it right now.

So instead, I'll look at a few extreme case studies, namely guys who beat their eBABIP by 50 points or more, and see informally if a pattern emerges. First, let's look at guys who were grossly underestimated by eBABIP--guys who had residuals of -.050 or less. Since my data is arranged by player-team-year and covers 2005-2007, a player can appear up to three times, and wouldn't you know it, three guys do:

Willy Taveras, Joey Gathright, and Luis Castillo all hit at least 50 points better than their line drive rates would suggest in each year from 2005-2007. As predicted, all three are slap-happy singles hitters with above average speed. In fact, they are more or less each other's closest active comparables. The guys who make the list twice mostly fit the bill as well:

Derek Jeter
Howie Kendrick
Kelly Shoppach
Pablo Ozuna
Preston Wilson
Ryan Shealy
Tadahito Iguchi
Victor Diaz
Willy Aybar

Though, don't ask me to explain Kelly Shoppach beyond the small sample size: as a backup catcher he never got more than 200 PAs.

On the other extreme, the list of guys who came up short of their expected BABIP by .050 points or more is topped by lead-footed St. Louis backup catcher Gary Bennett, who appears three times. Bennett has a slightly above average LD%, but does absolutely nothing with it. In 2007, he managed just 54 total bases on 44 hits, good for a .221/.298/.271 line. Still, there are no shortage of good hitters who make the list, including Eric Byrnes, Barry Bonds, and Frank Thomas, all of whom appear twice. Like these three, many on the list are slow sluggers or, like Bennett, catchers. The complete stats for all members of the "50 Point Club" can be found here.

The next step would be to get a better handle on the shape of the relationship between these speed factors and BABIP, and to come up with a more nuanced version of eBABIP that reflects them. Fat chance getting that out of me today.

No comments: