Forty-Five Percent Mental: On the development of an expected-ball-in-play production stat, and other developments for the near future

Note: This is the first of my posts to really have any significant math content. I've tried to write it in a way that's reasonably approachable for a layman (including links to wikipedia pages for concepts that may be unfamiliar), and have no idea if I've succeeded, so please leave a comment letting me know how I could do better if you have the time.

In my first post on ball-in-play outcomes, I noted that one possible use of the new statcast data would be the development an "expected wOBABIP" stat (or "xwOBABIP" for short, which is a name amusing enough that the thought of people eventually saying it out loud serves as sufficient justification for this exercise all by itself). In a few days, I'll be posting some preliminary results in this direction. In this post, I'll cover what approach I'll be taking and why, possible problems or nuances of this approach, and finally some (not so) bold predictions about what we might see when I finally do crunch the numbers.

(For those who haven't read the original post and/or are unfamiliar with wOBA, an explanation can be found here. "BIP" stands for "balls in play," so wOBABIP is simply wOBA calculated exclusively on balls put in play)

The general idea behind the xwOBABIP stat will be precisely as I outlined in that original post - take each of a player's batted ball results, weight it by the wOBABIP of "similar" balls in some bin in trajectory-space around it, and average. However, while this idea is simple, the implementation admits a rather dizzying array of possible variations in how the specific concerns are handled. Here are a few such concerns, to give you an idea:

How should we bin the "similar events" around a given ball-in-play event? Naively, we could evenly tile the trajectory-space with bins (as I've done with hex bins for the plots I've generated) and just take the bin that the event happens to fall in. However, this has a rather obvious undesirable behavior: it has sharp discontinuities at the boundaries between the bins, meaning that a tiny change in the trajectory of a ball might yield a large change in the expected value, simply because it falls near the boundaries (which, moreover, are entirely arbitrary). We pretty clearly don't want to introduce arbitrary artifacts into our metric, so this is a problem.

The obvious solution is to instead center the bins at the events themselves, computing a new bin for each datum - in essence performing a moving average centered at each datum. However, there's yet another problem: even if we bin the data with a moving window, if we simply bin the events in a binary fashion (i.e. either an event is in the bin, or it is out of the bin), we may still see large, arbitrary "jumps" in our average when a datum passes over the boundary of the bin. So, we may further consider doing a moving weighted average, with a window function that smoothly falls to zero away from its center, so that data gradually lose influence over the expected value at a point as they move further from the point, rather than suddenly dropping off after some fixed distance. This is a method called "kernel regression," where "kernel" refers to the aforementioned weighting function. But this leaves us the problem of picking a kernel (our aforementioned binning is equivalent to picking a rectangular function as the kernel).

And regardless of naive binning or fancy kernel-based averaging, we still have the problem of sizing our bins. A larger bin size ("bandwidth") will yield a higher sample size (which means less sensitivity to noise), at the cost of washing out (meaningful) structure in the data, while a smaller bandwidth will do the opposite - and, moreover, since our data is bivariate (we have exit velocity and exit angle), we have two different bandwidths to pick (one for each variable). To muddy the waters even more, our data are fairly "clumpy" - there are regions of densely packed data where lots of balls have been hit, and regions of sparse data where not many balls have been hit - so we may want to vary the bandwidth based on the density of data in the region, so that it is smaller in the dense areas (where the sample size is higher and so there is less noise) and larger in the sparse areas.

As it turns out, this is a very well-studied problem space and there's something of a cottage industry of competing approaches for solving each of these issues. So, as noted, this makes for a rather vast space of possible approaches we could take with this seemingly simple idea, with no clear single "best approach" to default to. This means that a lot of the specifics of how I do this are going to be chosen quasi-arbitrarily.

For my part, I'll be sticking to what's easy to do given my means, which basically comes out to "what's available in R libraries that I can wrap my head around in a reasonable amount of time." Fortunately, there are R packages for just about everything, so I'll be using the np package to do a Gaussian kernel regresion with a generalized-nearest-neighbor bandwidth (if you're interested in details, I'll probably post my code later) selected through a cross-validation scheme. If that sounds like Greek to you: it means that I'll be doing the moving-window weighted average described above where the weighting function is a Gaussian, and the width of the Gaussian at any given point is determined in some algorithmic fashion from the data.

(If you're wondering why I made these specific choices: the Gaussian kernel was chosen completely arbitrarily, simply because it's the default for the package. Likewise, the specific cross-validation scheme used was chosen arbitrarily. I did try several different bandwidth types before settling on generalized-nearest-neighbor, and that one did the best job of preserving fine detail in the dense portions of the data while avoiding meaningless spikes in the sparse bits. I certainly would like to be able to make expertly-informed choices as to all the parameters I can tweak, but unfortunately that's just not feasible unless I want to sink an inordinate amount of time into reading and understanding journal papers on all the various competing computational methods for this problem. Of course, this means someone else could approach this same problem and do it a different way and get different results, and be just as justified in their choices as I am. But the results of doing this with a slightly different approach ultimately had better be very nearly the same, or else this probably not so worthwhile to begin with.)

Now, by plotting the output of this kernel regression evaluated over a grid covering trajectory-space, I can generate fancy smooth versions of the hex bin plots from my earlier post. Here's an updated version of the wOBABIP plot, made by doing just that:

At least to my eyes, this is something of an improvement over the earlier hex bin version. On the down-side, computing the bandwidths for this took approximately an hour, and I think it's an O(n^2) problem so I shudder to think how long it'll take to re-do this in a few months with a full season of data.

In fact, one possible drawback of this approach, in general, is that it's not exactly easy to compute. Once you have the kernel regression, computing xwOBABIP for any one player is trivial - but computing the kernel regression is highly computation-intensive. A possible mitigating factor is that once we've computed the kernel regression for a sufficiently large sample, we don't necessarily need to do it again every time we have more data come in, under the assumption that new data won't change the league-average wOBABIP distribution too much. Even so, this isn't going to be a stat that you can simply calculate on a whim in excel, which gives it a somewhat higher barrier-to-entry than something like BABIP or xBABIP.

Once we have our kernel regression, there's still another subtle issue we must tackle before we have our finished metric: about a quarter of balls put in play do not have statcast trajectory data. Moreover, for the partial data set I've plotted, wOBABIP on those balls without statcast trajectory data sits at ~.250, as opposed to ~.360 for balls in play as a whole, which tells us without much doubt that those balls in play which are missing data constitute a biased sample. Thus, we cannot simply omit them without skewing our metric. I think the best way to handle this, which is still not great, is to use simple imputation and give events without trajectory data an xwOBABIP value of simply the wOBABIP of events without trajectory data.

And there you have it, a full how-to guide for developing an xwOBABIP stat. All that remains is to crunch the numbers on a more recent data set.

So, now, for predictions. I'm going to try to make a point, moving forward, of making predictions before I crunch the numbers on any of my new statistical tests (of course, if my predictions are correct, you have to take me at my word that I haven't actually crunched the numbers ahead of time - but I suspect this will be moot because I'm bound to be horribly wrong a lot of the time). This is a good intellectual habit to get into, as it keeps you from falling into the quagmire of testing hypotheses suggested by the data, which is one of the quickest ways to start convincing yourself (and then others) of nonsense. Unfortunately, this is a piece of epistemic hygiene that is sorely lacking in a lot of places - not just in sabermetrics, but in many ostensibly scientific fields such as nutrition, economics, or "big data." Looking for trends in data and then inventing stories to explain them is extremely risky business that's almost certain to lead you astray, unless the trends are very large and obvious and the explanations very parsimonious.

So, my bold predictions for what we'll see in the initial run of xwOBABIP? Mainly, I think we'll see exactly what any reasonable fan would expect to see - hitters who are drastically underperforming relative to their career averages will likely have a wOBABIP that's lower than their xwOBABIP, and hitters who are overperforming will likely have the opposite. If I were pressed to give some names, Adam Jones and Anthony Rendon (as mentioned in my previous post) are probably in the former group (though given Rendon's struggles last year, perhaps not), and you could probably pencil in Dexter Fowler and Travis Shaw for the latter.

Of course, the truly interesting question is if a player who is over/underperforming his career averages does not have the aforementioned difference between their wOBABIP and xwOBABIP, how confident should we then be that the change in performance is "real?" Unfortunately, to answer this we need to have some idea of how fast xwOBABIP stabilizes, and that's a question I don't think we'll be equipped to answer for some time yet. So, for now we'll have to speculate. But we should be able to say, I think, that players with large differences between their wOBABIP and xwOBABIP are probably due, on the whole, to regress towards the latter.

(Of course, that's not to say that there aren't reasons that such a difference could be sustainable - possible non-random causes for such a difference include shifting, quality of opponent defense, ballpark effects, and probably some other things I can't think of at the moment.)

I will also, in the near future, post some more, better visualizations of trajectory-space, including by-result probability graphs, as well as potentially taking a stab at net team defense metrics via. difference plots (though I suspect I still don't have enough data for this).

Gee, that ended up being pretty long-winded! I hope those of you who made it to the end of this found it worthwhile. See you in a few days, when the data are in!

1 comment:

asbMay 16, 2016 at 4:41 AM
Good job! Compared to than the raw binned data plots previously shown, the display grid size is much lower, so finer detail can be seen, while the filtering nicely suppresses the noise. There still are some islands of green in the large ocean of blue that are probably spurious and will go away when you plot a larger data set. The most striking improvement is the clear visualization of the effect of the outfield fence. There is a clear "valley" between "ridge" representing line drives (predominantly green) and the red "peak" representing home runs. The valley is much harder to see in the original plots.
It would be nice to see filtered versions of the batted ball type, BABIP, and sample size plots.

Monday, May 16, 2016

On the development of an expected-ball-in-play production stat, and other developments for the near future

1 comment: