A harrowing tale I have to tell, my friends in data visualization. A tale for only the hardiest of heart, but a tale needs must be told. Look upon my dataflow and despair.
You’d think I’d have learned my lesson back in 2016 but I truly hadn’t. Dear Reader, I am here to tell you, this is one lesson I heartily endorse for all of you. Heed my words. Do not follow me down these dark paths. This way lay madness; and I fell upon it, as it fell upon me. I have barely yet escaped to tell the tale, so through parched and cracked lips, drawing on the faintest breath I have left in me, I beseech you:
Back lo those many years ago, when thinking about politics was something I did for fun and from which I derived pleasure [BAHAHAHAHAHA it’s funny because it’s terrifying], I quite enjoyed the project my colleague Andy and I embarked on. In fact, it was an outgrowth of a project my manager and I started, which was to track the overseas bookmakers’ betting odds on who would win the U.S. presidency in the 2016 election.
Now, I TRULY do mean “many years ago,” since Tom (that’s my manager, then and now) started tracking this data sometime in 2013. He would pull it manually every few weeks or so, until we discovered that a site called Oddschecker was doing daily (even more than daily, eventually) updates that we could pretty easily download. (By download, I mean cut-and-paste.) I maintained one base viz showing the horserace up-to-date, and produced a few others when there were specific stories to be told.
Andy, my data science friend, was much more into the predictive analysis and polling side of things, and had developed an ever-growing disdain for some of the more popular and populist forecasters working the political beat. As such, he wanted to publicize his own predictive models, confident that he could best the models already out there in the world. Happily I offered to help visualize his models, his predictions, and (as best as technology allowed) his iterative process, to be as transparent about what the forecasting was saying and was based upon as we possibly could be.
Mostly, this is a story I’ve told before, several times, so to you oldheads, sorry to be treading the same ground again; but suffice it to say, we got REAL confident in the predictive models, so much so that we went so far as to predict the Senate races as well; we even tried to predict on a minute-by-minute basis not only who would win each state, but when those states would be called for the winner.
OBVIOUSLY this all went as sideways as was humanly possible.
(Well, that’s unfair. We didn’t just throw darts at a board. But being wrong in the margins, on something that operates on as thin of a margin as national elections, means that you’re REAL wrong on aggregate.)
It was in the wreckage of that election that I promised myself that I would step WAY the hell back from the prediction game. Visualizing the future was a terrible idea.
I was tormented by these ideas:
the things I had visualized were expressed extremely well, I thought;
the models were clear, understandable, transparent;
I showed data in a lot of different formats, to allow people to understand and remember them in whatever way spoke to them personally;
I liked the visual appeal;
And yet in the end, they were worthless--worse than worthless, in fact. They provided false hope. They conveyed certainty where certainty can literally never be found, for the future is promised to nobody and guarantees nothing.
And so, I found myself, in the first week of March 2019, awash in spreadsheets, Prep flows, nested calculated fields, and trying once again to visualize the future, on the eve of Major League Baseball's Opening Day.
Because I love the sport, and because I am a fan of Jacob Olsufka’s visualization style, it was a near certainty that I was going to participate in the baseball-themed #SportsVizSunday effort that he was co-hosting. At the time, we were still in the throes of the most underwhelming free-agent offseason in recent memory, waiting on generational superstars to sign mega-deals with new teams but having suitors show surprisingly little interest.
SIDEBAR WHILE I RANT ABOUT THE MEANS OF PRODUCTION IN BASEBALL AND OTHER LABOR STUFF
The new realities of baseball’s not-strictly-collusion economics seem to mean that the analytics departments across the league have all reached identical conclusions: the smart financial choice for a franchise is to field less-experienced players at lower costs than to commit to major outlays for proven, if potentially soon-to-decline, veterans.
This is only possible because of baseball’s salary structure, which restricts player movement from team to team until that player has acquired six years of Major League service. Before that time, contract values are essentially locked in for the first three years, and determined by an arbitration panel in the subsequent three years. Particularly in years 0-3, talented players end up providing on-field value far in excess of their salary. This is mitigated somewhat in years 4-6, but arbitration itself is an unpleasant prospect--a team has to present its best case AGAINST its own player, arguing how that player is worth as little money as possible--so many players agree to terms without going through arbitration.
All of this is to say, there’s very little free-market economy available to young players. Now, most teams in the league are refusing to pay market rate salaries to veterans commensurate with their past performance, using the argument that “players peak around age 27,” which in almost every case comes before a player hits free-agency, and therefore “we don’t want to play for a player on the decline.”
So if a player can only be paid what is value is during his ascendant and peak years, but a player is prohibited from negotiating in the free market until AFTER his peak years, when is he not being exploited?
I know, I know--from one perspective this is just millionaires arguing about whether they should be multi-millionaires. From my perspective, thought, it is about the right of the labor force to be paid commensurate with the value they provide.
And I suppose you could see this from the business owner’s perspective, that it’s just a shrewd strategy to get above-average performance at a below-average cost. But the current Collective Bargaining Agreement dates from a time when teams did not value young players so highly. The Players’ Association, operating from a position of assuming that its veteran members would be compensated highly, should their past performance and durability allow them to stay active long enough, was willing to concede the early years of prime earning potential to this current system that depresses their salaries. When the CBA expires in two years, it promises to be a huge bone of contention in labor negotiations.
But I digress.
The point is, I wanted to visualize this pay vs. performance disparity for 2019. I found a resource (rosterresource.com at first; baseball-reference.com, and its subsection that was once Cot's Contracts, eventually) that had all of the known salary information for not only 2019, but for every existing MLB contract. So if you were in year 1 of a six year deal, I knew that, and what the salary would be for each of those years. If it didn’t have salary data for a given year, it showed what stage of the salary progression the player would be in (pre-arbitration; arbitration year 1, 2, 3, or free agency).
I also found a resource (Fangraphs, not exactly a hidden gem) that had 2019 season “projected WAR” for thousands of players. (WAR is “wins above replacement,” a VERY convoluted metric that is dependent on many different things, but ultimately can be understood as “if you have Player X in this specific role on your team for every game of the season, you can expect to win [WAR] more games than if you were forced to play a freely-available barely-major-league-quality player there.”
The hard part, believe it or not, was getting roster information. Now, I was doing this in early March, when teams have not yet set their final rosters.
WAIT, LET ME EXPLAIN WHY A ROSTER IS BOTH APPLES AND ORANGES
During the baseball season, there is a “25-man roster.” Those are the 25 guys who will show up to the ballpark ready to play a game on any given day. (Well, technically, 4 of the 5 designated starting pitchers on a team won’t be ready to play, but they’re on the 25-man active roster.)
There is also a “40-man roster.” For our purposes, this is your 25 man roster plus another 15 guys who you think are likely to be required to contribute to the major league club at some point in the season--when Major Leaguers get injured, or traded, or underperform, replacements and reinforcements are always necessary.
Then, since it’s spring training, you have even more players in training camp. These include “non-roster invitees,” who do not have contracts with the organization at all; and, they include minor leaguers within the organization who are not on the 40-man roster because they are not at that level in their professional development yet...but might end up making the team if they impress during camp.
In early March, it’s VERY DIFFICULT to pin down which players are going to be on the 25 man roster, the 40-man roster, or no roster; also, there are still free agents out there who will sign with one team or another; players will be cut, and might join up with another team; and so on.
Beyond just figuring out which players are going to contribute to the Major League club in the regular season, I was also trying to figure out the depth chart--who the STARTING players were going to be.
WAIT, LET ME EXPLAIN WHY "WHO'S THE STARTER" IS ALSO ANNOYING
In a perfect roster world, the same nine players would start every game. HAHAHAH. This never happens. For one thing, the starting pitcher will only play once out of every five games. That means that you have five "starters" for one position: pitcher. Most teams have a pecking order of starters, from their SP1 (or "Ace") to SP5.
In the American League half of the Majors, there's also a Designated Hitter starter. So half the teams have one, half don't.
Many teams use a "platoon" approach to positions like LF or 1B or DH; they'll start a right-handed hitting player against a left-handed pitcher (it's a natural advantage for a hitter to be opposite-handed to the pitcher), or vice versa. Which player is the starter? (Answer, for my purposes: the one that Fangraphs projected to play in more games.)
A "closer" is a pitcher who is often asked to play in the final inning of a game in which his team is leading, because he "closes out" the game. He's considered to be the best non-starting pitcher on the team, has a specialized role, and will probably appear in 30-40% of his team's games. He plays in high-stakes situations, but not every day; is he a "starter" in the same way that, like, the SP5 is? I decided that he was, but that no other relief pitcher was.
What if there's an obvious starter--Luis Severino, for instance, a pitcher for the Yankees--who happens to be currently injured? Well, I counted that player as part of the Bench/Bullpen/Injured/Suspended/Fringe Crew that might get called upon eventually. What if that injury is super mild and he's expected back for the second week of the season? Well, then I'm hosed, because I'm not listing him as the starter and this viz is going to live with him on the bench for the foreseeable future.
Well, I tried a few different sources for this data, and I also tried a few different bright-line tests for how involved a player had to be in order to be included in this viz. Not to mention that I was going to have to disambiguate all of these different data sets (Fangraphs has unique player IDs but the other sites I used did not; some used nicknames, some didn’t (Pete or Peter? Nathan or Nate? etc.); some included “Jr.” or “II” when that was part of a player name, and some didn’t; some included accents and tildes for names of Latino players, some didn’t; they all had wildly varying ideas on how to capitalize and hyphenate Asian names.
Even worse news: most of this data was not downloadable; only scrape-able. In some cases, it was only cut-and-paste-able. And, of course, none of the tables were clean, none of them used the same names or abbreviations for teams, and so on and so forth. Look, I know this is a common data wrangling problem. I get it.
But the REAL problem is: the data changed every. Single. Day.
Why did it change every day? Because we’re trying to visualize the future! We’re not talking about last year, when we know who started and who played partial seasons; who made the roster and who didn’t; who was traded and what everyone’s performance stats were.
I, to my overwhelming and rueful annoyance, had chosen to pursue the creation of a viz that leveraged disparate, ever-changing data-sources, and would require me to guess at things that were eventually going to be certain...but only after I published the viz. Moreover, the whole premise of the viz was about PREDICTED on-field value. By the end of the season, the story will certainly be completely different. Just as in the election vizzes, time could very easily render this visualization not only moot, but absurd.
So I say this to everyone who read this far. I’m glad I did this viz. I’m happy with how it came out, technically and visually. I learned a few interesting things along the way and it got me in the mood for baseball season….not that I needed a lot of help in that department.
But for the love of all things you hold dear, learn from my pain.
VISUALIZE THE PAST ONLY!
Because visualizing the future in the present has every chance of becoming the embarrassment of your future past.