Monday, May 16, 2016

On the development of an expected-ball-in-play production stat, and other developments for the near future

Note:  This is the first of my posts to really have any significant math content.  I've tried to write it in a way that's reasonably approachable for a layman (including links to wikipedia pages for concepts that may be unfamiliar), and have no idea if I've succeeded, so please leave a comment letting me know how I could do better if you have the time.

In my first post on ball-in-play outcomes, I noted that one possible use of the new statcast data would be the development an "expected wOBABIP" stat (or "xwOBABIP" for short, which is a name amusing enough that the thought of people eventually saying it out loud serves as sufficient justification for this exercise all by itself).  In a few days, I'll be posting some preliminary results in this direction.  In this post, I'll cover what approach I'll be taking and why, possible problems or nuances of this approach, and finally some (not so) bold predictions about what we might see when I finally do crunch the numbers.

(For those who haven't read the original post and/or are unfamiliar with wOBA, an explanation can be found here.  "BIP" stands for "balls in play," so wOBABIP is simply wOBA calculated exclusively on balls put in play)

The general idea behind the xwOBABIP stat will be precisely as I outlined in that original post - take each of a player's batted ball results, weight it by the wOBABIP of "similar" balls in some bin in trajectory-space around it, and average.  However, while this idea is simple, the implementation admits a rather dizzying array of possible variations in how the specific concerns are handled.  Here are a few such concerns, to give you an idea:

How should we bin the "similar events" around a given ball-in-play event?  Naively, we could evenly tile the trajectory-space with bins (as I've done with hex bins for the plots I've generated) and just take the bin that the event happens to fall in.  However, this has a rather obvious undesirable behavior: it has sharp discontinuities at the boundaries between the bins, meaning that a tiny change in the trajectory of a ball might yield a large change in the expected value, simply because it falls near the boundaries (which, moreover, are entirely arbitrary).  We pretty clearly don't want to introduce arbitrary artifacts into our metric, so this is a problem.

The obvious solution is to instead center the bins at the events themselves, computing a new bin for each datum - in essence performing a moving average centered at each datum.  However, there's yet another problem: even if we bin the data with a moving window, if we simply bin the events in a binary fashion (i.e. either an event is in the bin, or it is out of the bin), we may still see large, arbitrary "jumps" in our average when a datum passes over the boundary of the bin.  So, we may further consider doing a moving weighted average, with a window function that smoothly falls to zero away from its center, so that data gradually lose influence over the expected value at a point as they move further from the point, rather than suddenly dropping off after some fixed distance.  This is a method called "kernel regression," where "kernel" refers to the aforementioned weighting function.  But this leaves us the problem of picking a kernel (our aforementioned binning is equivalent to picking a rectangular function as the kernel).

And regardless of naive binning or fancy kernel-based averaging, we still have the problem of sizing our bins.  A larger bin size ("bandwidth") will yield a higher sample size (which means less sensitivity to noise), at the cost of washing out (meaningful) structure in the data, while a smaller bandwidth will do the opposite - and, moreover, since our data is bivariate (we have exit velocity and exit angle), we have two different bandwidths to pick (one for each variable).  To muddy the waters even more, our data are fairly "clumpy" - there are regions of densely packed data where lots of balls have been hit, and regions of sparse data where not many balls have been hit - so we may want to vary the bandwidth based on the density of data in the region, so that it is smaller in the dense areas (where the sample size is higher and so there is less noise) and larger in the sparse areas.

As it turns out, this is a very well-studied problem space and there's something of a cottage industry of competing approaches for solving each of these issues.  So, as noted, this makes for a rather vast space of possible approaches we could take with this seemingly simple idea, with no clear single "best approach" to default to.  This means that a lot of the specifics of how I do this are going to be chosen quasi-arbitrarily.

For my part, I'll be sticking to what's easy to do given my means, which basically comes out to "what's available in R libraries that I can wrap my head around in a reasonable amount of time."  Fortunately, there are R packages for just about everything, so I'll be using the np package to do a Gaussian kernel regresion with a generalized-nearest-neighbor bandwidth (if you're interested in details, I'll probably post my code later) selected through a cross-validation scheme.  If that sounds like Greek to you: it means that I'll be doing the moving-window weighted average described above where the weighting function is a Gaussian, and the width of the Gaussian at any given point is determined in some algorithmic fashion from the data.

(If you're wondering why I made these specific choices: the Gaussian kernel was chosen completely arbitrarily, simply because it's the default for the package.  Likewise, the specific cross-validation scheme used was chosen arbitrarily.  I did try several different bandwidth types before settling on generalized-nearest-neighbor, and that one did the best job of preserving fine detail in the dense portions of the data while avoiding meaningless spikes in the sparse bits.  I certainly would like to be able to make expertly-informed choices as to all the parameters I can tweak, but unfortunately that's just not feasible unless I want to sink an inordinate amount of time into reading and understanding journal papers on all the various competing computational methods for this problem.  Of course, this means someone else could approach this same problem and do it a different way and get different results, and be just as justified in their choices as I am.  But the results of doing this with a slightly different approach ultimately had better be very nearly the same, or else this probably not so worthwhile to begin with.)

Now, by plotting the output of this kernel regression evaluated over a grid covering trajectory-space, I can generate fancy smooth versions of the hex bin plots from my earlier post.  Here's an updated version of the wOBABIP plot, made by doing just that:

At least to my eyes, this is something of an improvement over the earlier hex bin version.  On the down-side, computing the bandwidths for this took approximately an hour, and I think it's an O(n^2) problem so I shudder to think how long it'll take to re-do this in a few months with a full season of data.

In fact, one possible drawback of this approach, in general, is that it's not exactly easy to compute.  Once you have the kernel regression, computing xwOBABIP for any one player is trivial - but computing the kernel regression is highly computation-intensive.  A possible mitigating factor is that once we've computed the kernel regression for a sufficiently large sample, we don't necessarily need to do it again every time we have more data come in, under the assumption that new data won't change the league-average wOBABIP distribution too much.  Even so, this isn't going to be a stat that you can simply calculate on a whim in excel, which gives it a somewhat higher barrier-to-entry than something like BABIP or xBABIP.

Once we have our kernel regression, there's still another subtle issue we must tackle before we have our finished metric: about a quarter of balls put in play do not have statcast trajectory data.  Moreover, for the partial data set I've plotted, wOBABIP on those balls without statcast trajectory data sits at ~.250, as opposed to ~.360 for balls in play as a whole, which tells us without much doubt that those balls in play which are missing data constitute a biased sample.  Thus, we cannot simply omit them without skewing our metric.  I think the best way to handle this, which is still not great, is to use simple imputation and give events without trajectory data an xwOBABIP value of simply the wOBABIP of events without trajectory data.

And there you have it, a full how-to guide for developing an xwOBABIP stat.  All that remains is to crunch the numbers on a more recent data set.

So, now, for predictions.  I'm going to try to make a point, moving forward, of making predictions before I crunch the numbers on any of my new statistical tests (of course, if my predictions are correct, you have to take me at my word that I haven't actually crunched the numbers ahead of time - but I suspect this will be moot because I'm bound to be horribly wrong a lot of the time).  This is a good intellectual habit to get into, as it keeps you from falling into the quagmire of testing hypotheses suggested by the data, which is one of the quickest ways to start convincing yourself (and then others) of nonsense.  Unfortunately, this is a piece of epistemic hygiene that is sorely lacking in a lot of places - not just in sabermetrics, but in many ostensibly scientific fields such as nutrition, economics, or "big data."  Looking for trends in data and then inventing stories to explain them is extremely risky business that's almost certain to lead you astray, unless the trends are very large and obvious and the explanations very parsimonious.

So, my bold predictions for what we'll see in the initial run of xwOBABIP?  Mainly, I think we'll see exactly what any reasonable fan would expect to see - hitters who are drastically underperforming relative to their career averages will likely have a wOBABIP that's lower than their xwOBABIP, and hitters who are overperforming will likely have the opposite.  If I were pressed to give some names, Adam Jones and Anthony Rendon (as mentioned in my previous post) are probably in the former group (though given Rendon's struggles last year, perhaps not), and you could probably pencil in Dexter Fowler and Travis Shaw for the latter.

Of course, the truly interesting question is if a player who is over/underperforming his career averages does not have the aforementioned difference between their wOBABIP and xwOBABIP, how confident should we then be that the change in performance is "real?"  Unfortunately, to answer this we need to have some idea of how fast xwOBABIP stabilizes, and that's a question I don't think we'll be equipped to answer for some time yet.  So, for now we'll have to speculate.  But we should be able to say, I think, that players with large differences between their wOBABIP and xwOBABIP are probably due, on the whole, to regress towards the latter.

(Of course, that's not to say that there aren't reasons that such a difference could be sustainable - possible non-random causes for such a difference include shifting, quality of opponent defense, ballpark effects, and probably some other things I can't think of at the moment.)

I will also, in the near future, post some more, better visualizations of trajectory-space, including by-result probability graphs, as well as potentially taking a stab at net team defense metrics via. difference plots (though I suspect I still don't have enough data for this).

Gee, that ended up being pretty long-winded!  I hope those of you who made it to the end of this found it worthwhile.  See you in a few days, when the data are in!

Friday, May 13, 2016

Lies, Damn Lies, and...: Anthony Castrovince Edition

This'll be the first of a series of blog posts where I demonstrate a particularly egregious example of bad statistics and then hold it up for mockery.  Think of it as a statistical "What's Wrong with This Picture?" series.

The (dis)honor of the inaugural post in this series goes to Anthony Castrovince (of MLB.com and Sports on Earth), whose most recent column details "13 individuals or entities who, for one reason or another, can claim bad luck without it sounding like a loser's lament."  The column itself wasn't so bad, save for a few annoyances like describing FIP as "a measure of the things a pitcher has under his control" (as if a pitcher magically has control over a batted ball outcome if it happens to go over the fence, but not if it lands two feet in front of it), or speaking of balls whose trajectories had the same "exact specifications" without any mention of error size, or (as per usual) quoting Statcast numbers to three significant figures.

Towards the end of the article, however, Castrovince presents us with a real whopper:

This...is not how averages work.

(For those readers who don't immediately see why this is abject nonsense: imagine that Adam Jones hit all of his balls at a 94mph exit velocity, but hit half of them at a 52-degree launch angle and the other half at a -28-degree launch angle - a distribution that is, indeed, consistent with his "average trajectory."  Using the tool Castrovince links, we should expect Adam Jones to have a BABIP of under .100!  The key point here is that the mean does not uniquely specify, or even provide much useful information about, the actual distribution of batted balls and so we can't draw any conclusions at all about what results Rendon or Jones should be getting on their balls in play from it alone.)

For the record, I think it likely that both Jones and Rendon are, indeed, suffering from unlucky batted-ball outcomes.  I intend to investigate this, and the batted-ball luck of other players, in the coming weeks with the "xwOBABIP" stat outlined in my previous post on ball-in-play outcomes.

Monday, April 25, 2016

Cracking the Ball-In-Play Nut

A bit over a week ago, I pulled the ball-in-play data from the first 10 or so games off of Baseball Savant and started toying with it in R.  In particular, I was looking to confirm some intuitions I (and other people I know who are frustrated with some of the common attitudes towards balls in play in baseball analytics) have about ball trajectory and batted-ball outcome.  The results were surprisingly promising, and after showing them to some friends and family (and having those friends and family bug me about putting this stuff online), I now have a blog and will be sharing them with the yawning chasm of the internet.

For years, balls in play have been a sort of black hole for baseball statistics - a sort of fluke-y quagmire of unexplained variance into which the data simply could not provide all that much meaningful insight.  What little characterization could be done relied much on vague (and inherently subjective) bucketing, classifying hits as "fly balls" or "line drives."  Since little could be done to explain ball-in-play numbers, much of baseball analytics has grown around the notion that they are "unreliable" - pointing to an inflated BABIP when predicting that a player who's had a hot month will regress has become a fairly ubiquitous trope.  In fact, some of the currently-popular metrics (e.g. FIP) simply ignore balls in play entirely!

The rub is, of course, that ball-in-play results are not truly random.  Even FIP's most vehement supporters are forced to, on occasion, accept this fact.  Nor should we, in particular, expect ball-in-play results to be random: that some balls are well-hit and others are poorly-hit is a fact that's self-evident to even the most casual observer, and it is obvious to anyone who has actually watched the sport that Miguel Cabrera tends more towards the former than does Omar Infante.

Unfortunately, the obvious intuition for ball-in-play results ("players who tend to hit the ball well will have good results") might do well for identifying the obvious (we expect Miguel Cabrera to have better ball-in-play results than Omar Infante), but they don't go much further than that.  And, until recently, we simply did not have many numbers that could go much further than that, either.

(Of course, this is only true for us mere fans - teams have had access to HitF/X for years, which means that all of the analysis I'm about to outline has probably already been done but is proprietary.  Oh, well.)

Well, that was the case.  Now we have Statcast, and while MLB.com seems interested in little more than using it to quote pitch velocities and home run exit velocities to dubious significance (I'm sure Chris Davis' home run off of Craig Kimbrel was traveling precisely 111.2 mph), a whole wealth of new approaches is open to us.  Let's look at a few of them.  All data are through April 15 - I'd re-do these plots again with more recent data, but I'm lazy and none of this is going to have decent sample size for a long while yet anyway.  This is all purely exploratory, and I don't suggest we draw any conclusions until we have a full season (or several) of numbers to pick through.

First, let's take a look at what we used to be doing.  As mentioned earlier, stats like xBABIP have, historically (if one can use that term for a stat that hasn't existed for all that long to begin with) been based off of crude bucketing of batted balls into subjective categories: "line drives," "ground balls," "fly balls," and "pop-ups."  Until now, we've had no way of knowing how well-defined these categories are, or where they lie in trajectory-space.  Well, wonder no longer:

(Here, and for the rest of this post, I am characterizing "trajectory-space" by vertical launch angle and exit velocity.  We are, obviously, throwing out a whole lot of information by doing this: horizontal angle and spin are the obvious ones, but not-so-obvious ones include wind speed and air temperature and, for batted-ball results later on, any number of things that impact whether or not a ball is caught.  Consider this caveat-ed).

Percentage of balls hit on a given trajectory that were classified as, from the top, a pop-up, fly ball, line drive, or ground ball.


A few things are immediately apparent (aside from the few wacky outliers, like that "ground ball" at the 50+ degree launch angle - I wonder who goofed on that one, Statcast or the human?).  Firstly, the only one of these categories that doesn't have an extremely fuzzy boundary is ground balls.  That makes perfect sense, as it's a lot easier to tell if a ball hits the ground (at least, to my sensibilities) than to judge whether or not it was "fly ball-ish enough" to qualify as a low fly ball rather than a high liner.  Secondly, following from this, "line drives" are a supremely messy category.  Thirdly, the "fly ball/pop-up" distinction is not purely an angle-off-bat thing - there's clearly a pretty big dependence on exit velocity, too.

While the finer points of this might be interesting from a cognitive science standpoint ("what makes an observer think of a trajectory as a 'pop-up?'"), the main take-away is that this is a pretty crude and inconsistent bucketing of trajectory-space, and as we will see is not really capable of capturing the most important (and immediately-visible) trends of the ball-in-play data.  We now can (and should) do a whole lot better.

So, about those "immediately-visible and important trends."  The obvious thing to do now is to look at actual results of balls in play, and how those results vary over trajectory space.  For want of more clever ideas of how to do this, I simply assigned to each outcome its corresponding wOBA weight and plotted the average value over trajectory space.  The resulting "wOBABIP" (now there's a name for a stat - I don't know if we'll be able to say this one in real life too often without laughing) graph is below:


Again, some things are immediately apparent, and in ways that are quite pleasing to the intuitions of anyone who's watched baseball.  That nice, big blob of red corresponds to the home runs, and is more or less where we'd expect it to be.  We see a nice streak of good results along a ~20-degree launch angle, confirming that, as we all knew, Line Drives Are Good.  Interestingly, though, the window for a "good line drive" is not only quite narrow, but has a pretty clear dependence on exit velocity - harder-hit balls need to be lower in order to avoid staying up long enough to be caught.  Conversely, softer-hit balls need to be higher to make it over the infield.  In fact, this trend carries smoothly right into the most satisfying part of all - the nice big blob of "bloop singles:" the wimpy-exit-velocity counterpart of the home run blob, the pitcher's worst nightmare in graphical format.  Here, we see exactly what all of us knew - if you're going to hit a fly ball, you either want to hit it out of the park, or dink it just over the infield, driving opposing pitchers (and fans) mad.  Anything in-between is an easily-fielded fly-out.

Also of note is the relative lack-of-worth of ground balls.  As we might expect, the results on grounders does get better with increased exit velocity (you can see this even more clearly on the following BABIP chart), but the vast majority of grounders are bad.  Want to know which pitchers can sustain a low wOBABIP?  There's an obvious place to start.

For the sake of completeness, here's the same chart, but with BABIP instead of wOBABIP:

I wholly expect (or hope?) that a lot of this noise will clean itself up once we have a larger sample, though for now it can really be entertaining to look at a few of those singles and wonder how they got through.

One last (not-all-that-exciting) plot - here's the raw sample size per bin:


I don't find this one earth-shatteringly insightful, myself, but it does give us a notion of what potential "batted ball profile" plots might look like for individual players.

So, I've spoken mostly about how these figures confirm our intuitions, but what can we do with them?  Well, quite a few things, it seems to me.  Let me list a few things I'd like to dig into in the future (or see someone else dig into):

1)  Development of new "xBABIP" and "xwOBABIP" stats (because "wOBABIP" isn't already funny enough to say).  Take a player's ball-in-play events, assign each a weight based on the league-average BABIP/wOBABIP in some bin around its position in trajectory-space, then average.  Obvious complication: proper sizing/shaping of bins (especially the proper relative scaling of exit angle and exit velocity, which does not appear obvious to me).  I'm fairly sure that even naive, suboptimal binning will do better than the batted-ball classifications seen above, though.  It would be nice to finally ditch the "inflated babip -> regression" trope, which I've never found particularly convincing - It seems to me a much stronger argument that a player is bound for a drop in performance if we can show that he's been getting on base in spite of hitting the ball poorly, rather than just noting that his balls in play have fallen in for hits.  Of course, to what extent hitting the ball "well" in this "xwOBABIP" sense is repeatable is also a question, which leads us to...

2)  Investigating the stabilization rates of said stats - or, more fundamentally, the stabilization rates of individual players' batted-ball profiles (as measured in this trajectory-space).  This is all of limited utility unless we can figure out over what sample size we'd expect to see real effects.  This obviously requires the development of some easy way of calculating a "variance" from these plots - it could be as simple as some sort of moving-window average error summed over the whole plot, or perhaps it will need to be more complicated than that.  I don't know!  I guess we'll have to just try stuff and see.  It will also be interesting to see if there are any repeatable time-dependent trends over the course of a season (obviously we'll see more home runs in warm weather, and that should show up), though we'll probably need lots more data for that.

3)  Net team defensive metrics.  There is no way that team defense will not show up on these plots.  Obviously, it will show up confounded by myriad other effects - ballpark effects, weather effects, "luck," anything that could possibly go into whether or not a ball hit at a given vertical angle at a given speed will end up in a player's glove.  But, confounded or not, they should be there - and with larger sample sizes we could probably block our data for a few of the confounding variables.  We could probably even look at infield and outfield defense separately, since we should be able to guess pretty well by a ball's location in trajectory-space whether or not it reached the outfield.  Of course, a huge portion of what shows up will also be shifts and general defensive positioning, but that's as much a part of team defensive value as is the actual physical ability of the players.  It will be interesting to see how defensive metrics formulated this way compare to UZR and the like.

4)  Better investigation of what actually goes into contact-management skills for a pitcher.  I recall an article on FanGraphs a year or so ago that investigated wOBABIP versus pitch location, and found quite clearly that low-and-away and high-and-inside pitches fare far better than anything else.  We could now do similar investigations by looking at the pitch location heatmaps of pitchers who manage a consistently below-average xwOBABIP.

This is an exciting new (well, for fans - again, I'd be surprised if this hasn't all been done already with HitF/X) frontier of baseball analytics, and I can't wait to see what comes of it.  I'll eventually be posting more stuff as I dig through the numbers more.

Sunday, April 24, 2016

First post!

Welcome to all of my (nonexistent) readers.  This is going to be a blog where I talk about baseball.  And baseball statistics.  And math.  And whatever else my brain decides is a good idea to spew into the void of the internet, to bounce around from server to server for the rest of time...

So, a little about me, I guess.  I'm a 23-year-old soon-to-be Ph.D. student with a B.Sc. in mathematics from the University of Maryland.  I enjoy watching (and overthinking) baseball.  I also enjoy doing and teaching math, cooking and eating food, building robots, playing guitar, and a large number of other things.  Some of these will likely be the subject of future posts.

I'm going to cut this off here, since (in case it wasn't clear already) I don't exactly have much to say.  There will be some content, soon!  I promise!