What's a Tier List? A mathematical approach

TL;DR: you can skip to the tier list below, if you want to
A Tier List is a ranking of the character based on how “good” they are. The problem comes from defining “good”.
Let’s take Rock-Paper-Scissors. Rock beats Scissors, Paper beats Rock and Scissors beats Paper. Is there a better option than the others? Of course not, they all are equally strong because, on average, if both player picked randomly, all options will win an equal amount of times.
On the other hand, let’s say that Rock has a 55% (5.5 in Match-Up terms) probability of winning against Scissors, Scissors has a 52% probability of winning against Paper and Paper has a 53% probability of winning against Rock.
Now it is pretty clear that Rock is stronger if both players are picking randomly, since it leads to the most number of wins.

If players, instead, can choose which option to use, we start our mindgames: yeah, Rock is stronger, but because it is stronger its counterpick, Paper is also strong. Scissors is strong because it is the counter-counterpick to Rock and Rock gets a bit stronger because it is the counter-counter-counterpick to itself.
We can write this relationship in the form of a matrix, called the payoff matrix:
R P S
R 50 47 55
P 53 50 52
S 45 48 50

Let’s imagine a round-robin tournament in which players choose one option and stick with it until the end.
Which option will the players choose the most? How many players will choose each option?
To answer this question, we’ll have to have a bit of a digression.

Let’s say you go to Rock city, where everyone plays Rock because they think it’s stronger. If you go there and use your payoff matrix, you can actually see that the best option for you is playing Paper.
Amazed by how good you played in the tournament, the players in Rock city wonder how you managed to choose. You explain your method to them, and the next tournament they all use the same method.
On the next tournament, there will be a lot of Paper because it was the counter to the most played option, but also some Scissors because they were the counter to the second most played option.
If you repeat this process multiple times, you’ll eventually end up in a state in which the percentage for each option stays constant (although different one from the other). This is called Nash Equilibrium.

The method we are talking about works as follow (warning, it requires some basic linear algebra, you can skip this if you want):
Let’s call r, p and s the percentage of Rock, Paper and Scissors players in the last tournament. If we create a vector v = (r, p, s). If we multiply the payoff matrix M by v we get a new vector, x1, of which the first value is the expected strength of the Rock option, the second value is for Paper and the third is for Scissors.
If we compute x2 = Mx1, x3=Mx2 and so on, we eventually get to a point where x is not changing enough. THAT is the Nash Equilibrium.
By the way, if you run the math on the RPS example, we get 50.6%, 51.6% and 47.6%

Now, to go back on Yomi, I took the data from the Yomi Composite Chart and run the math with that MU chart (which is what we call payoff matrix).
The resulting Tier List is:
S: Degrey, Zane
A: Grave, Lum, Troq, Setsuki, Valerie, Geiger, Onimaru
B: Argagarg, Menelker, Persephone, Gloria
C: Jaina, Rook, Quince, Gwen, Vendetta, Midori, Bal-Bas-Beta

This study makes a couple of assumptions:
First of all, it doesn’t take into account counterpicking. This isn’t really an issue if the selection is double-blind at every step of the game or if players stick with the same character all the time.
Secondly, it assumes that at the first step of computations every character is equally played. More skewed initial conditions will require more steps to converge.
Thirdly, it assumes that characters are only picked based on their power. We know that is not true, and we have many more Rook players than our Tier List would assume we do.

The first assumption isn’t too big of a problem: our objective is optimizing the first game, because even if you are counter-picked you can counter-counterpick on the next game to level the field and so on.
The second assumption, again, is not a problem: in my calculations, 3 steps were enough to converge, but with real world data you should converge to the same results in more or less the same amount of step.
The third assumption, instead, is the main problem. Players are not very logical, and they may resist changing their main because the meta is unfavorable. If you take this into account, and your objective is maximizing the wins you’ll get in the next tournament, you could take recent data and figure out the best characters to play.
For example, using the Historical Chart, which features data since 2014 (you might want to use only data from 2018 onwards to be more precise, but that might be too little data points), the new “Tier List” (which isn’t a Tier List but rather the characters you could pick from the best choice to the worst) is:

S: Troq
A: Grave, Degrey, Lum, Zane, Setsuki, Valerie, Geiger, Onimaru
B: Argagarg, Quince, Menelker, Persephone
C: Jaina, Rook, Gloria, Gwen, Vendetta, Midori, Bal-Bas-Beta

The beauty about this method is that you can change the MU chart freely and even tweak it for specific tournaments. For example, let’s say you want to guess how the meta will be shaped for an 18xx tournament. Some characters that are considered “good” perhaps were good because the top played characters where good MU to them, but with those out of the question the standings are changed.
To compute the results for 18xx you could simply set a starting population in which the banned characters have 0 coverage. As such, the first step of computation will not keep the banned characters in mind when trying to find the best options, and you could simply pick the top-rated non-banned character from the tier list that method outputs.

And, finally, here’s the Sheet I used for those computations. It’s not the cleanest sheet, since it was meant for quick testing, but here’s a quick overview:
In the Tier List sheet you see the results of the computations. There’s not much to play around here, except maybe you can change the column in the apriori winning% column to X if you want the first step, Y if you want the second step and Z if you want the third step.
Computations is where the magic happens.
Firstly, the MU chart. If you don’t believe the chart I used is a good chart, you can simply change those numbers. Heck, if you want to, you can even put your own values in case you have an especially hard time with certain characters (but if you do so, keep in mind you can only compute the first step, since from the second onwards you’d be assuming that everyone has your same problems with the same characters).
Secondly, the historical data. You can change that to whatever you want. If you set a value of 1 to the game column, you are computing the results with the assumption of equal initial popularity.
Last but not least: the amount of steps. If you want to compute more steps, you can simply add another column on the right and clone the now second-last column into the newly created column.

In conclusion: keep in mind this is just maths. Psychology plays a big role in evolving a metagame, and that can only partially be accounted for in computations.
Also, despite us talking about S+ and C tiers, there’s actually an extremely small difference between the worst character and the best character, so play whoever you like the most.

2 Likes

Hi TonyHole, nice to see someone else looking using Nash equilibria to make tier lists. There was a similar discussion here last year.

Two things I’d like to comment on, having looked through your sheet.

  1. You’re directly using Scymrian’s composite matchup chart. I’d recommend reproducing the composite chart instead, because the values given in the original Sheet have been rounded to the nearest 0.25. In addition to the loss of precision, this means some matchups are displayed as non-symmetric, e.g. Grave vs. Jaina is 5.75, but Jaina vs. Grave is 4.5.

  2. Your iterative method, accounting for some scaling differences, takes a matchup matrix M, and converges to the vector v that solves v = Mv. This is not the Nash equilbrium, it’s the matrix’s principal eigenvector, i.e. the part that doesn’t decay with repeated multiplication. I’m not sure how to do it in Sheets, but for the Nash equilibrium you want to solve the equation here. The iterative method is doable in Sheets, but requires a few more steps. You should find the difference between best and worst characters is rather larger than you show here.

1 Like

As per point 1, sure I can do that without too much of a problem, but I doubt the results are going to change. I didn’t notice the asymmetry, though, that’s a fair point.
Point 2, instead: yeah, I used the term “Nash equilibrium” unproperly, but that usually gets the concept through to people who at least have heard of it.
I still stand by my approach, though, because the eigenvector (especially if found iteratively, instead of directly solving) offers a cause-effect estimate on the phenomenon.

EDIT: while checking the asymmetries stuff, I noticed that even without the mround the Scymrian’s chart was asymmetric. I checked all the other charts and found the following anomalies:
IAmCBob: Quince vs. Arg
Staryu: Oni vs. Setsuki
DRB: Valerie vs. Arg
Character specialist: Vendetta vs. Arg
For each of them, I added (or subtracted) what was due by giving (or removing) the same amount to both MUs.

This changed the tier by a little bit, I’ll be updating the main post.

EDIT#2: since I was there, I also computed one of the Nash Equilibria. This NE tells us that the “viable” characters are, in descending order:

  • Grave (31%)
  • Troq (30%)
  • DeGrey (14%)
  • Persephone (13%)
  • Geiger (12%)

So, uh, this generates some conflicted feelings in me. While it’s important to note that MUs are quite balanced in Yomi, so this doesn’t mean that Grave and Troq are that much stronger than the others, the result is pretty cool.
So, those percentages mean, more or less, that the most optimal strategy is picking Grave 31% of the times, Troq 30% and so on.
The reason why Troq and Grave are so up high is that they are the only characters among those 5 to have two fair MUs: the mirror match, and Troq vs. Grave.
This means that if you pick, for example, Grave, the only MU you are afraid of is the DeGrey MU. However, picking DeGrey is a bold play since it is hard (again, “hard” is relative) countered by Troq, which is the other most played character.
Geiger, on the other hand, is a “hard” counter to Troq and is soft countered by Grave and DeGrey, which would make it a very solid choice if it wasn’t for Persephone, the real dark horse in here, that hard counters it.
Persephone also has a more or less fair MU against the big two, getting soft countered by both of them, but also counters DeGrey.

I have very little experience with most of those MUs, but I didn’t expect my girl Persephone to make it up there thanks to Nash.

1 Like

I have no comments on the math itself, just wanted to note that the title made me think this:

What is a tier list? A miserable little pile of characters! But enough talk—have at you!

2 Likes

What do you do? You just rank him. You take your keyboard and go to the tier list, and you rank him. Just ranking him is an amazingly simple concept that eludes us etc. etc.

2 Likes

That’s fair enough. I haven’t quite got my head around what the principal eigenvector actually tells you here, but tiers you’ve used it to create like mostly reasonable. Forming tiers from Nash isn’t an obvious process, anyway, with most of the cast getting zero weight.

You might be interested in the work vengefulpickle’s been doing some trying to separate the character strength and player skill effects in the historical data here. I don’t know what his matchup chart looks like at the moment, but it could be interesting to compare.

I was talking to VP just yesterday in the Discord, since he’s trying another approach to mathematically creating a tier list.
As of now, I’m not going to distinguish between skill and actual character because, in an ideal world, those MU charts should represent the odds the best player with one character has against the best player with another character.
The world is far from ideal, so that changes things, but still…

As a sort of update, I also tried a genetic approach. Nothing crazy, despite the name: I just initialized a random population and made them play against each other (according to the MU chart). The losing player has a chance to change character, while the winning player keeps their main.
I was expecting results close to the eigenvector, but I was surprised at how close the results were for a, more or less, empirical approach.

1 Like

NE precents do NOT tell you the best characters (in general). Lets take the simplest mu chat (X represents a mirror)

X 10 4
0 X 6
6 4 X

This results in the following payoff matrix (subtract 5 to get zero sum):

0 5 -1
-5 0 1
1 -1 0

The Nash Equilibrium is: 0.142857 0.142857 0.714286

Very obviously the BEST character is the one with a 10-0 good mu and a 4-6 bad mu. But they do not see the most play. In fact they are tied in use percent with the character they 10-0.

1 Like

Yes, because their great matchup is against a bad character that isn’t expected to see play often, and their much more frequent matchup is disfavoured. Good matchups don’t help much if you’ll rarely see them.

I’ve thought about this a bit. Humour me for a moment, and suppose that the Nash equilibrium is the “correct” way to calculate the optimal meta.

Character 1 has a low weight in that meta, but it’s also true that the meta is drastically changed by that character being removed, so the character has a large role in shaping / warping that meta. How much would you prefer it if that was accounted for when trying to enumerate character strength?

A difficulty I’m having with this is actually answering the question of: What do we mean by a tier list? What is character strength? Do we mean “this is a character that it’s important to be able to play in order to win tournaments”? Or “this is a character that can win against most other characters”? Or “this is a character who warps the meta by existing”? Each of those is going to lead to a different analysis. I think in most FGs, where most players main a single character, the question being answered is “What one character can I play that will win most often?”, which, I think leads to an analysis like the one I posted in discord (basically, just counting fraction of winning MUs, and the quantizing those into tiers) being reasonable.

For Yomi, though, I think it’s more interesting/useful to answer something closer to “Which characters will be relevant in the tournament meta?”. For that, I think an analysis that finds the Nash Equilibrium using the Bo5 or Bo7 standard-CP payoff matrix, and then from those nash percentages, looks at what characters are used during counterpicks, would be an interesting one (I feel like I did that analysis at some point, but I don’t recall the results).

1 Like

Let’s set aside Tier Lists for a bit, and let’s look at a little something.

What can I do to maximize my chances of winning a tournament?
In a BO5 where you can counterpick at every round, a very important factor is winning the first game of the BO5, since that puts you in an advantaged state (if the opponent wins the next by counterpicking, and you win the following one by counter-counterpicking and so on, you win in the end).
In order to maximize your chances of winning the first game, you want to have a favorable MU against the opponent’s character. Not knowing the opponent’s choice because of double-blind, though, limits your choice to a guessing game.
Most specifically, having an estimate of how many players are playing character A, B and C gives you an opportunity to pick a character that has an high apriori winning probability (which might not mean using the character with the best MU against the most played character, but rather with a good MU overall against the most used characters).
If you have an estimate of which characters are going to be played (i.e. historical data), just using the matrix method gives you the character with the highest apriori chance to win (step 1 on my sheet).

However, the first pick isn’t only influenced by the apriori chance of winning the first game, but also by the chance of winning against the possible counterpicks and the chances of your counter-counterpicks winning against the counterpick.
This can be calculated, too, although it isn’t covered in my sheets, with a similar method. To be precise, it should be possible to calculate a BO3, BO5 or BO7 matrix that tells you which first character has a higher chance of making you win an overall match.
If there’s any interest, I might work on that and see if I get reasonable results.

The problem with Nash Equilbrium (which also is present in eigenvector methods, although it is much stronger in NE) is that it assumes completely logical players on the whole community. As stated previously, while that assumption might somehow work for just two players, it definitely doesn’t hold for larger groups, because there are going to be more Rook or less BBB players than the math asks for simply because people like Rook’s design and dislike BBB’s.
That’s why, for large rosters, NE is usually not the most solid of choices if you want a practical definition of the metagame. One step of the matrix method gives a better answer to the question “what’s the optimal choice to win game 1?”, while two steps give a good answer to the question “what’s the optimal choice to win game 1 assuming other players are using a step 1 approach?” and so on. The eigenvector approach is just this reasoning that goes on forever. By design, no strategy has 50% or more apriori winning probability against the eigenvector strategy (assuming equally skilled players and true randomization in picking the character), although step N has a higher chance to win against step N-1 than eigenvector has against any other step because, by design, step N is the optimal counter to step N-1.

TL;DR: eigenvector solution are the ultimate “no u” which can’t be “no u”'d but, at most, “no we”'d on game 1. A similar eigenvector solution may be computed for BO3/5/7 without additional historical data.

If we define “tier list” as “a list of the most played characters assuming everyone is using the eigenvector strategy”, then here you have your tier list. You might agree with me that that’s also synonimous with how “relevant” a character is (although, as @charnel_mouse pointed out, even the least relevant character has an impact on the metagame), which we could misname as how “powerful” that character is.

Yup, I’ve done that calculation in the past. I’ve got the results squirreled away in an ipython notebook somewhere. (I might have posted them in the other Yomi Data thread, too? I forgot.)

It’s probably this one, isn’t it?

Yeah, that’s definitely one. It doesn’t actually list out the payoffs for each of the best-of-N choices, sadly, so we can’t do the eigenvector method directly.

God, I’ve done so much math about this game…

This isn’t true, though. The best strategy against step N-1 is the character with highest weight in step N. This can have win probability greater then 50% win probability a priori, as your spreadsheet shows for Zane and DeGrey.

Apart from rationality, the main problem with any approach like this, even if we had perfect matchup knowledge, is that player skill effects will be much stronger than the effect of character matchups, and that a player’s skill level can depend on the characters involved. This can change the “optimal” meta completely. I don’t remember how venge’s Yomi stuff worked out, but for Codex my work has player effects much larger than deck effects, and that’s just among players playing on the forum.

This isn’t true, though. The best strategy against step N-1 is the character with highest weight in step N. This can have win probability greater then 50% win probability a priori , as your spreadsheet shows for Zane and DeGrey.

I may have said that a bit unclearly, but I think we are saying more or less the same thing.
Let’s say the step N-1 strat is 50%, 30% and 20% for three options, while the step N is 40%, 45%, 15%. This means that by choosing option 2 you’ll win 45% of the times, which is higher than 40% and 15%, so it’s of course the best choice (which is what you said).
On the other hand, I read it the opposite way: step N-1 says that the best strategy is playing option 1 50% of the times, option 2 30% of the times and option three 20% of the times. For each of these options, the best option in step N changes. To be precise, step N says that the better option will be option 1 40% of the times, option 2 45% of the times and option 3 15% of the times. Overall, if both players pick randomly among their strategy, the step N will have more than 50% likelihood of winning. If, for some reason, you know that the opponent will play the most rated character of step N-1, the strategy is not step N-1 but rather it’s a modified version of it, which can be vectorised as 100%, 0, 0.

On one hand, picking the best option from your strategy looks like an improved version of that strategy. On the other hand, randomizing your choices will increase your likelihood of winning but it comes with two downsides:

  1. You NEED to know how to play all the match-ups
  2. You often won’t get extremely favorable MUs, but rather only mildly favorable (in which case the skills you have with that specific character matter the most, which makes point 1 even more problematic)

Apart from rationality, the main problem with any approach like this, even if we had perfect matchup knowledge, is that player skill effects will be much stronger than the effect of character matchups, and that a player’s skill level can depend on the characters involved.

In my calculations, I’m assuming you don’t know the opponent before your match. Of course this is suboptimal: I’m not using a data I could probably use just for sake of generalization.
I’m really bad against Setsukis, for example, and really bad at playing Gwen. This needs to be accounted for when you want to compute your best choices for the next step.
A simple way to do that is to modify your own chart, and thus creating a personalized MU chart with which you can do all the computations (however, as I said earlier, you can’t compute more than 1 step since you’d be assuming your opponent has the same traits as you).
You COULD do this for your opponent: if everyone’s MU chart was public (i.e. we knew everyone perfectly), you could use that, too as possible info.
Finally, there’s some coupling happening because of specific interactions: for example, I’m pretty gutsy in some of my plays (i.e. aggressively blocking even when it’s highly likely my opponent wants to throw me, making my opponent think that dodging into throw would be better than raw throwing) but much more conservative in other plays (i.e. dodge confirming into highly-damaging attacks). A player who is more methodic in neutral and in the early game and more gutsy in the late game is basically my nemesis, and I have worst MUs against that player with most characters.
However, as all coupling effects, those are extremely tough to factor in because they are highly non-linear and usually strongly reliant on specific situations.

I purposely chose to ignore most of these factors to get a player-agnostic chart (which assumes all players to be equally strong and logical) which can be customized by each player based on their own special traits.

@TonyHole: I’m not sure if you’ve already seen it, but Yomi Skill actually lets you pick out two different players, and it’ll give you a personalized estimate of MU results for all characters between those two players. I’ve used it guide my CP strategy at times.

This is part where we disagree, I think. The distribution for step N is proportional to each character’s probability of success against the entire random distribution at N-1, not their probability of being the best counter-pick. Therefore, the character with highest weight in step N is exactly the character that does best against the entire random distribution at step N-1. The optimal strategy against step N-1 is a pure strategy where you only take that character, using the distribution at step N instead is strictly worse.

The distribution for step N is proportional to each character’s probability of success against the entire random distribution at N-1, not their probability of being the best counter-pick.

We agree on this. Usually, the highest rated characters are not the best counterpicks to some relevant characters, but rather good counterpicks against most relevant characters.

Therefore, the character with highest weight in step N is exactly the character that does best against the entire random distribution at step N-1. The optimal strategy against step N-1 is a pure strategy where you only take that character, using the distribution at step N instead is strictly worse.

The character with the highest weight in step N is indeed the character with the highest expected payoff, but if its weight is under 50% it means that when it fails it fails hard.

Just in case I botched the math, I set up a quick Python script to simulate both scenarios:

  1. In scenario 1, one player played according to the historical distribution (the one I have set up in my sheet), and the other one plays according to step 1 applied to the historical distribution.
  2. In scenario 2, one player played according to the historical distribution, and the other one plays only the highest rated character in step 1 applied to the historical distribution.

Scenario 1, over the course of 10 simulations (with 100.000 games played per simulation) returned the following winrate:

  • 50.304%
  • 50.357%
  • 50.495000000000005%
  • 50.254%
  • 50.071%
  • 50.285000000000004%
  • 50.515%
  • 50.21600000000001%
  • 50.148%
  • 50.235%
    [Note: the figure after the zeros in many of those numbers is due to how python does arithmetic]

Which, as expected, is always over 50% (actually, in testing I did get one under 50%, like on 49.7% or something, but still…).

On the other hand, although it shouldn’t need simulations to tell, scenario 2 gave the following results:

  • 47.42%
  • 47.592%
  • 47.393%
  • 47.392%
  • 47.361%
  • 47.327000000000005%
  • 47.306%
  • 47.257%
  • 47.402%
  • 47.266000000000005%

Which proves two things:

  1. I DID botch the math, since I expected a winrate of 52% or something (Troq’s winrate). I’ll have to double check this, although it feels strange to me
  2. A mixed strategy can be better than a pure strategy, and it always is for non-pathological cases in which you apply the matrix method.

In case anyone wants to check the script I used, you can find it at this link. (Mind you, I’m not a programmer by profession, so you’ll find ugly code in there)
data.csv is Scymrian’s corrected MU chart (the one on my sheets), and histdata.txt is basically copy-paste of the games column in the historical data I also have in my sheets.