Jul
23
2024

UPDATED: Player Ranking Methodology

Numbers! Data! Applied math! If I’m really lucky, maybe I’ll even get to put together some graphs and build out interconnected dashboards. I enjoy looking at baseball statistics and I really like baseball cards. Any opportunity to bring the two together is time well spent.

One of the first things I did after deciding to write about my collection was to come up with a scalable method of ranking the performance of players within a given baseball card checklist. Most collectors probably have some sort of ranking of top players within easy mental reach, but for me the real fun begins when asking fellow collectors their nominations for 10th, 100th, and even 1,000th best player in their collection and how they came to that conclusion. Someone who can answer that query has really thought out the cards in their possession and you can get some insight into the mental models they employ in the process.

Of course, a one way pop-quiz that requires showing your work doesn’t make much of a conversation. If I was to ask someone to go that in depth about their rankings I would need to be able to uphold my end of the discussion as well. The result of my initial ranking system was separate hierarchies for position players and pitchers. Both heavily utilized wins above replacement (WAR) as their foundation with modifications applied to overcome the cumulative nature of the metric and to better reflect my personal leanings when it comes to the aspects of the game I find most enjoyable. The result wasn’t a ranking of the best players, but rather an ordering of players who most resembled my personal ideal of a ballplayer.

In general this system worked well, but it had its shortcomings. Position players and pitchers were ranked separately and could not be valued head to head. Relief pitchers were valued much too highly. I frequently found myself mulling more effective scoring models that would require additional data gathering to fully flesh out. Finally, with the underlying stats for my models having been assembled years ago it failed to reflect pre-integration play outside of the National and American Leagues (e.g. Satchel Paige played more than just his six partial AL seasons).

I recently undertook a redesign of the player infographics used here at CardBoredom and decided the time was right to address the reservations I had about the old system. Here’s what comprises the revised model:

Position Players

The original framework used in my ranking was based on a mix of cumulative and rate stats, giving a two-thirds weight to a player’s cumulative WAR and a one-third weight to weighted on-base average (wOBA). Each component was scaled against a maximum score of 100 points, so that someone who led both categories (i.e. Babe Ruth) would score 100 and those of lesser rank would see a lower value assigned.

This worked nice and simple. WAR is widely understood by baseball fans and provides a pretty decent overview of how a player stacks up against his peers. The metric accounts for all aspects of a player’s performance, combining hitting as well as contributions on the basepaths and in the field. It is cumulative, which allows long careers to be appreciated but can provide misleading rankings when comparisons are made between steady “compiler” and high-peak/quick drop-off guys.

wOBA is a purely offensive measure and is a major contributor to my preferred method of calculating WAR. Why would I want to essentially double count its contribution by considering it in addition to WAR? I wanted the resulting player value to overstate my favorite aspect of the game (ability to mash the ball) while understating the impact of positional scarcity. WAR includes adjustments that boost the value of hard-to-fill positions while devaluing the contributions from the likes of designated hitters and corner outfielders. Phooey. Give me Ken Griffey, Jr. (77.7 WAR) over Brooks Robinson (80.5 WAR) any day.

The advantage of using wOBA over other rate-based offensive metrics is its ability to depict batting skill while avoiding the cumulative nature of WAR. Essentially a superior form of batting average, wOBA captures everything batting average successors slugging percentage and on-base percentage wish they could. It takes every outcome experienced by a hitter and compares it to the game’s historical average of runs scored as a result of the same outcome. Walking and getting hit by pitches contribute to a team’s offense. A single generates more offense than a walk, as baserunners have the opportunity to advance by more than one base. Doubles, triples, and homeruns likewise generate even larger gains.

This ranking system worked well most of the time, but the wOBA component needed a tweak to really capture just how close or far apart two players could be in the resulting standings. Specifically, the original model called for a player’s wOBA to be compared to that of Babe Ruth (.513), scaled out of 100, and counted as one-third of his overall composite score. A perfectly average batter with a wOBA of .320 and zero career WAR would therefore generate a composite score of 20.8 points out of a possible 100.

The change I made in this iteration of performance rankings was to begin measuring the difference between a player and record career wOBA rather than just looking at nominal wOBA versus the record. After all, why give a player credit for below average performance? Why not lower the resulting composite score for degraded batting skill? Average wOBA (.320) was scaled so that this would produce a contribution of 0.0 to the composite with above average wOBAs adding to the score and below average readings subtracting from it.

No such adjustment was needed for WAR, which already incorporates negative values to represent performance that hurts the ability to win ballgames.

After tallying up composite scores from each player’s WAR and wOBA figures I was left with the all-time Top 10 table below. Babe Ruth’s perfect 100.0 score fell a bit from my earlier figures as his .513 career record was replaced in the standings by Josh Gibson’s eye-popping .521 career measure.

RankPlayerWAR.wOBAComposite Score
1Babe Ruth167.0.51396.9
2Barry Bonds164.4.43583.6
3Ty Cobb149.1.44579.1
4Ted Williams129.8.49479.0
5Willie Mays149.8.40973.7
6Rogers Hornsby129.1.45973.4
7Lou Gehrig115.9.45770.9
8Tris Speaker130.2.43670.1
9Honus Wagner138.1.40868.9
10Stan Musial126.4.43568.5

Could this have been taken further to incorporate ballpark effects and other factors that effect comparability across different eras? Of course. Do I want to add them into my model? Not really. Time periods with excessively high .wOBA readings are more exciting in my eyes, making a model that systematically enhances players who excel in this area something that I look forward to.

Pitchers

A more pronounced adjustment was needed in my approach to measuring pitching effectiveness. I previously used a combination of cumulative stat and a rates metric to rank pitchers, assigning a 2/3 weight to WAR and 1/3 to FIP- (adjusted fielding independent pitching). While this worked well among the sport’s top 50 or so pitchers, the ranking quickly became overrun with relief pitchers who sported absurdly good FIP- measures. Going on the assumption that a reliever is most likely a failed starting pitcher, this result probably does not produce a very accurate overall ranking of pitchers.

WAR and FIP- remain my favorite metrics for evaluating pitchers. WAR presents readers with a single cumulative lifetime measure of effectiveness. Lengthy careers are a positive, provided the pitcher was skilled enough to generate positive WAR values over an extended period. Pitchers’ WAR were compared against Babe Ruth’s MLB record mark of 167.1.

The now familiar cumulative WAR measure was scaled back from 67% of a player’s composite score to 50% to accommodate another way of viewing WAR into the model. One of the issues I previously struggled with was assigning a relative value of a pitcher who plays in only a fraction of a team’s games against a backdrop of everyday position players. In addition, relief pitchers were overrunning the mid-levels of my player rankings. The solution was to create something akin to the WAR per 162 games figure frequently reported for position players. While the frequency of starts has shifted over the years, I assumed a traditional four-man pitching rotation that generates 40 appearances per season. This results in an annualized WAR production figure that is roughly comparable to any position player and has the added benefit of scaling back the impact of relief pitchers. This annualized WAR is given a 25% weight in the overall composite and compared against Babe Ruth’s record annual production of 10.8 per season.

The calculation underpinning my preferred measure of WAR is heavily influenced by fielding independent pitching, an estimate of what a pitcher’s ERA would be if calculated solely on outcomes that need no help from other defensive players, largely in the form of strikeouts, walks, and homeruns allowed. Pitching has undergone several large scale changes over the past 100+ years, transitioning from a period in which doctored balls were expected and homeruns were rare to a time that saw roided up beasts swinging at triple digit pitches that didn’t even exist a generation earlier. A 3.00 ERA means vastly different things depending on what decade it is generated in. FIP- takes a hypothetical ERA that has been stripped of outside influence and makes appropriate adjustments for the era in which it was generated. This number is generated as a scaled measure in which a pitcher of perfectly average skill displays a score of 100. As an approximation of ERA, lower scores indicate better performance and readings above 100 indicate a below average capability.

Originally given a one-third weight in the scoring model, FIP- was somewhat scaled back in importance given the need to fit WAR/40 into the system. A weighting of 25% was used. Similar to the tweak made to .wOBA for position players, I adjusted the formula to measure the difference between a pitcher’s FIP- and 100 against the difference of 100 and a reading of 64, the best career reading for any starting pitchers in history (Jacob deGrom and Jose Fernandez). Readings above 100 will result in a net negative contribution to overall composite scoring.

Using the above inputs a newly constituted list of all-time Top 10 pitchers was assembled in the table below. Roger Clemens continues to rank as the best player to regularly take the mound and does so with a composite score high enough to crack the sport’s Top 5 from any position. The extent to which some of these players outclassed their contemporaries is amazing.

RankPlayerWARWAR/40 GamesFIP-Composite Score
1Roger Clemens133.77.571.278.0
2Randy Johnson110.57.173.268.8
3Cy Young131.55.880.067.5
4Greg Maddux116.76.377.865.7
5Walter Johnson116.45.875.665.1
6Pedro Martinez84.47.167.863.4
7Clayton Kershaw75.87.169.660.1
8Bert Blyleven102.95.982.058.5
9Nolan Ryan106.75.383.356.9
10Christy Mathewson90.05.776.056.6

Combining Everything Into a Coherent Ranking

The models above produce scores with values going as high as 100 points. They are scaled so that position players and pitchers can be evaluated side by side, finally allowing my previously separate rankings to be combined. I would like to see a bit more data from the Negro Leagues, though published comments from many at the forefront of this data gathering effort show the great majority of these games have been entered into the record books. Some sort of adjustment for artificially short careers (e.g. military service, leagues with short seasons, labor actions) could be interesting, though the heavy weighting of rate-based stats likely already accounts for this at some level.

I’m happy with where this system now stands and have implemented the revised models across each of the card profiles posted to CardBoredom. Returning to the questions posed at the beginning of this post, I can easily respond to anyone who asks about player rankings beyond the the very top of the game. I have Babe Ruth as the sport’s greatest player in terms of my preferred metrics. Stan Musial ranks 10th, Jim Thome 100th, and Tim Belcher rounds out the top 1,000. Who do you have at those positions?