This debate reminds me of pythagorean_expectation. This was used by Bill James in baseball. If I remember correctly, it does a better job of predicting future wins and losses than their previous season’s win/loss record. It states:
Pythagorean expectation is a sports analytics formula devised by Bill James to estimate how many games a baseball team “should” have won based on the number of runs they scored and allowed. Comparing a team’s actual and Pythagorean winning percentage can be used to evaluate how lucky that team was (by examining the variation between the two winning percentages). The name comes from the formula’s resemblance to the Pythagorean theorem.
I think you could reasonably substitute the number of runs scored and allowed for game wins and losses in a set. I am fine with either implementation of ELO, but I think both sides of the argument have merit.
The sheet is on the fritz right now (I need to do my caching fix to get under the execution timelimit), but it turns out by doing the prediction correctly based on the ELO before the match, the expected results for the set-based Elo line up almost perfectly.
The problem with the baseball argument is that it’s keeping track of relative score throughout a single game rather than a series. That would be the equivalent of tracking percentage of life remaining each game, rather than game wins. Obviously that would be heavily skewed because of certain characters using life as a resource, healing effects or characters that are willing to eat hits as part of their gameplan, etc. which makes this stat unreliable as a metric… maybe? It would be a pain to find those numbers, but it would be interesting to see if there was an overall correlation despite the skew.
Gamewins might possibly still be fine as a way to track, as long as elo adjustments are made between games rather than between matches as well. If a player who wins the first round has less to gain and more to lose in the second, this means that a player who is too predictable and doesn’t adapt will see that reflected in their elo, more than a player who lost the second game of the three. Theoretically, someone who consistently goes W/L/L will have an elo which reflects their overall poor adaptation skill in a way that simulates the same effect from matchwin tracking, with the benefit of also having some reward for players that perform well even in lost matches. There might be something I’m missing, but I believe that’s how things should work out.
Game wins are the score during a set. Runs are the score during a baseball game. If you wanted to get increasingly granular, this goes both ways. We could track hits in baseball and combat wins in Yomi. I think runs are a pretty close equivalent to game wins. They are both points that lead to the win.
Edit: I will admit that the game vs. series contradiction is true. I tend to look at a set of Yomi as a close equivalent to a baseball game. One Yomi game is closer to an inning in baseball than an entire game. This all depends upon your perspective.
Here is an approximation of why I think this way. One baseball game tends to have over 200 pitches (combats). One set of Yomi might have 75 combats. I think any single Yomi game just doesn’t contain as much data as any single baseball game.
If a baseball game has 200 pitches, then each pitch is worth 1/200th of the decision in a game. If each yomi game has 75 reveals, each reveal is worth 1/75 of that game’s results. The fewer the events, the more precious they are and the more each opportunity to score counts. No matter how many instances, they all always add up to 1. Thus, having fewer opportunities to score does not therefore mean scoring is less significant. In fact, it causes the opposite effect, where the pressure is on for each player to do more with less opportunities. The length of a game does not correlate toward the value of a win or loss.
It’s true under normal circumstances that the more data there is to be collected, the better. However, the fewer decisions you get to make in a game, the more those individual choices matter toward the outcome, which puts the accuracy of that rule into question. It is legitimate to point out that there’s not as much data in a single game of yomi, but there’s an equally legitimate flipside to that argument.
I wouldn’t say this is an easy call either way, to be sure.
After fixing up the sheet again, and using the pre-match Elo rather than the post-match Elo, by eyeball, the set-based Elo is a better predictor of set wins than the game-based Elo is of game wins, so I’m going to leave the rest of the sheet it terms of set wins.
According to that chart it looks like there was an 800 ELO upset, but it’s really difficult for me to dig through the data to find it. Is there an easy way to track it down?
I think one argument for using game wins is that it’s a consistent measure. Set wins vary in terms of predictive power depending on how many games are played in the set.
@vengefulpickle, the elo spreadsheet is not updating. The latest match results data is not current. It appears as if the latest match results are from 1/25. I would appreciate it if you could look into this. Thanks.
I think the problem might be that someone with write permissions has to look at it in some cases. I’ll open it with an anonymous browser to try and reproduce the problem.
@vengefulpickle , something has definitely changed with the ELO spreadsheet. cpat jumped around 300 points to move way above the rest of the pack. Also, the ordering of the player rankings has definitely shifted.
Oh, yeah. I recently ran some numbers to figure out what Elo parameters were the most effective at predicting the matches in the historical record. The result was a longer warmup period where Elo changes faster (and a slightly higher rate of change during the longer period).
Hi @vengefulpickle, I’m working on Elo rating for Codex players and was wondering which K value you are using for the Yomi Elo rating. Do you use variable K value that changes as a player accumulates more data?
I think I use a two-phase K: 97 for first 55 sets played, and 40 after. Those particular numbers were chosen by using a grid-search on the available data, IIRC.
That is interesting. That is the highest K value I’ve seen from the examples. Is it because the range is too compressed with a lower K value due to smaller data size compared to Chess or MTG? Codex has much smaller data size I think. Player with most games played is around 150 and many of us are in 30-50 games range. Would you recommend the same 97-40 K value for Codex? What is a grid-search? Was wondering if I can try that on Codex data.