clock menu more-arrow no yes mobile

Filed under:

Thinking Out Loud About Statistics

I have a theory that the majority of what I say will eventually make me cringe, usually in about two years or thereabouts, and thanks to the internet this theory is more supportable than ever. Today I embark on a new line of utterances that are sure to shock my future self... because you asked for it.

I am a pretty big baseball fan, and one of my favorite developments of the last 20 years has been the new breed of statistical analysis. While the anti-stats people have a right to complain that statistics aren't everything, there is quite a lot to learn from them in baseball. For example, traditionally people relied on batting average and homers to tell you who the best hitter was, but starting with Bill James people began looking into why these numbers didn't always correlate with wins. Turns out, the first job of a hitter is not to make an out, and so on-base percentage overtook batting average as the most basic measure. But not all non-outs are equal, which raises slugging percentage. And a slew of stats, starting with OPS and developing into a slew of comparative measures, took over the narrative.

Cycling doesn't have many stats. Wins are nice, but when Andre Greipel ranks third overall, you can see the limits of that raw number. We award points for placings in races, but I wonder how meaningful this is. IMHO, the CQ Rankings are a fine starting point to understand who, over the course of the season, is steadily producing results. But point totals are heavily influenced by how much you race.

Take the case of Damiano Cunego vs. Davide Rebellin. Rebellin ranks third overall, and I certainly won't deny him his due for a fine season. But is he really better than Cunego, who ranks fifth? Points-per-start can separate out the quality from the quantity, but even that gets watered down by grand tours (Cunego wound up with 11 more race days). Head-to-head stats are nice, and CQ's calculator is a fun place to mess around, but this doesn't account for races which one or the other missed, like the Rebellin-free Giro di Lombardia. Which rider is better? I have no idea.

I want to create statistics that isolate out certain skills. I want to know who the best rider for the hilly classics is, so I want to find a way to award credit in those races that fully recognizes the relative value of a guy versus his competition, with enough data to have some integrity. I want to know who the best bunch sprinter is, the best climber, the best chronoman, etc. Therefore, my winter project will consist of playing with numbers to see what we can learn. I'll probably have a couple deputies in this effort, but like everything else here the subject is open to anyone with something to contribute. Not being a statistician, I'll be very open to help, corrections, challenges, etc. And I've added a "Numbers" section to capture these posts.

Obviously this is all experimental in nature. Getting enough data to pass the statistician's laugh test will be hard, and coming up with ways to deploy the data will be a creative venture, to put it kindly. So, take this conversation for what it is: an attempt to shed some light on our favorite subjects, not a definitive one. Stay tuned.