Codex data thread

OK, here’s the first pass at showing how the meta’s changed. Lum’s Lucky Lottery 2 was left off, since the decks were randomly picked from a small set chosen by the host. There’s obviously a lot of variation between individual tournaments, but you can see some general trends. Most won’t be a surprise to people who’ve been here for a while.
Looking first at the starters:


Specs:

  • Disease was really common in XCAFRR19, but otherwise has been mostly ignored.
  • Demonology and Strength become less popular during XCAFS18, since they were part of the banned combinations. Necromancy just became more popular.
  • Truth has slowly disappeared recently.
  • Finesse has slowly gained in popularity over time as people found how well it works in a lot of decks.
  • Ninjitsu has a high proportion in LDT1, mostly because it was a small tournament with 8 players, and two people happened to pick it.

Finally, we look at decks, limited to decks that have been picked at least four times. This is just enough to include all the monocolour decks, and [Demonology]/Anarchy/Balance.

  • PPA’s fallen from grace, compared to the recent rise of [Discipline/Strength]/Finesse.
  • [Necromancy]/Blood/Truth, i.e. the endless Crashbarrows deck, was far less commonly picked than I expected.
  • Nightmare has been consistently popular, except for the brief period after XCAFS18, when the bans prompted people to mix up the meta a bit.

And before anyone asks,

I did try some versions using the starter and specs' card colours. Turns out if you put all the specs' colours together at once the resulting colour palette is hideous.

3 Likes

Crashbarrow.dec is a really fun deck, but its kinda a one trick pony. It has trouble dealing with miracle grow and nightmare, and to an extent PPA, so it fell out of favor in this meta. PPA also fell out of favor because miracle grow proved to be slightly faster.

2 Likes

I’ll add Crashbarrow.dec to the nicknames list, thanks for reminding me.

Based on the plot @mysticjuicer linked to here, a first attempt at 2D tier plots.

Each one has a standard version, giving mean effect vs. effect standard deviation, where effect is the win probability for the moncolour decks and the log-odds effect for the starters and specs, and an experimental version, that uses the mean Nash weight instead of the main effect. Silmerion describes the axes as high tier / low tier vs. well-rounded / has issues.

Starters


Specs


Monocolour decks


Usual caveats:

  • these reflect average play rather than optimal play because the model can’t distinguish them yet
  • these use mean effects over all simulation samples, so don’t reflect the model’s uncertainty
  • the model currently doesn’t account for within-deck synergies at all
  • the Nash results are lagging behind everything else because calculating them currently takes ages
1 Like

I’ve sped up the Nash calculations so they finish within 24 hours, so they can be updated along with everything else now.

Total multicolour Nash weights for starters and specs (effective tier list):

Starter P1 P2 Both
Black 0.256 0.266 0.320
Purple 0.100 0.220 0.164
White 0.202 0.102 0.146
Red 0.180 0.102 0.121
Neutral 0.110 0.113 0.105
Blue 0.081 0.110 0.084
Green 0.072 0.086 0.060
Spec P1 P2 Both
Necromancy 0.369 0.233 0.376
Blood 0.252 0.208 0.280
Strength 0.185 0.200 0.211
Demonology 0.188 0.202 0.200
Finesse 0.226 0.154 0.199
Past 0.114 0.251 0.186
Peace 0.169 0.168 0.180
Bashing 0.153 0.169 0.180
Anarchy 0.250 0.135 0.173
Truth 0.132 0.172 0.148
Balance 0.094 0.144 0.111
Growth 0.124 0.106 0.107
Future 0.080 0.139 0.102
Discipline 0.184 0.056 0.102
Present 0.073 0.163 0.101
Disease 0.082 0.139 0.099
Feral 0.079 0.097 0.069
Fire 0.064 0.119 0.064
Ninjitsu 0.097 0.067 0.059
Law 0.085 0.078 0.055

Monocolour Nash:

Deck P1 P2 Both
MonoBlack 0.443 0.230 0.522
MonoRed 0.289 0.185 0.230
MonoPurple 0.043 0.291 0.111
MonoBlue 0.075 0.168 0.070
MonoWhite 0.101 0.035 0.036
MonoGreen 0.049 0.091 0.031

Player 1 advantage under Nash equilibrium:

Format P1 win probability
Monocolour 0.549
Multicolour 0.530
1 Like

Two things as we approach the start of CAMS:

  1. As with the previous standard tournament, I’ll be tracking the tournament results on my site, to see how well the model predicts the match outcomes. In theory, I can put up match predictions as soon as the matches are announced each round. However, I’m worried that this would affect the matches if the players see the predictions beforehand, so unless people specifically request predictions I’ll only put them up after the relevant match has finished.
  2. About a year and a half ago now – wow, have I been working on this for that long? – @Metalize sent me some additional match data, which was a mixture of casual matches and a tournament at Igrokon. I’ve had an alternative model running for a while, which incorporates his data, but haven’t talked about it much. This was because it was giving weird results, where adding his data seemed to make the predictions much worse. Recently I discovered that I wasn’t checking its predictions properly: this has now been fixed, and the predictions look pretty good. Once CAMS is over, I’ll post properly about how adding his data affects things. Apologies to Metalize for taking so long to do this.
1 Like

Hah, year and a half, that’s some time =D

This dataset also gained additional touchpoints with forum Codex meta — @bolyarich is IgorB from the dataset. Consider including this cross-reference!

1 Like

Cool, didn’t know you were on the same circuit! I’ll bundle all his games under bolyarich, and mark IgorB as an alias. I can do player skill plots for your players like I’ve done for the forum players, if you’d like?

1 Like

You even ask? Of course we’d like :smiley:

Great! I’ll link bolyarich’s games up first, then post it up here. Or pm, if you’d prefer. I can add your players to the plots on my site too if they’d like, but only if you say yes, since we’re dealing with real names here.

1 Like

They all agreed to this back when we were collecting the data so feel free to publish the stats :smiley:

1 Like

OK, first version of player skill plots for Metalize’s players, based on his data and the data from the forum matches. I’ve split this into two versions:

  1. “Metalize tournament data” excludes the recorded casual matches, so it only looks at tournament matches as I do on the forum data. This mostly consists of matches at Igrokon 2018, or the intra-city qualifiers for it.
  2. “Metalize data” also includes the casual matches. Players are ordered by their rank in the latter version, since not all the players are present in the tournament data. @bolyarich is on the plots as bolyarich, @Metalize is on them as LeonidG.

Since all these matches are from 2017 and 2018, and I’m not yet modelling change in skill over time, the players’ skill levels could be very different by now.

Player skills

Player probabilities of being best player

Player skill rank distributions


bolyarich got bumped up a lot by including his forum matches, since he’s got 31 matches recorded on the forum and only 7 in Igrokon. Otherwise, most of the player skill distributions are pretty wide, due to the small number of games involved, outside of a few players when including casual matches.

We can also have a look at how adding Metalize’s data affects the model’s views on the monocolour matchups:

Monocolour matchups

Player 1 on the rows, Player 2 on the columns. Matchup is probability of P1 winning.

Including the casual matches causes some big matchup shifts, I’d be interested in people’s opinions on whether the changes make it more or less accurate.

Tracking page for CAMS20 is now up.

This is amazing stuff! (In truth, after most of the statistics details in the first few posts went over my head, I didn’t even attempt to understand what was going on inside your models. However, just the amount of time this must’ve taken is impressive)
I was wondering why you haven’t gathered more data from non-tournament matches. Is it because of the time involved (in that collecting match data from a tournament must be much more efficient than scanning loads of different threads) or is it because you feel the data from casual matches won’t be as representative of true match-up advantages as players will try weirder strategies or not focus as much?

1 Like

Good question! It’s a mixture of both. The main goal for the model is to work out how good the starters and specs are in high-level play, so limiting things to tournament results seems the best way to do that. It also meant I could go back and capture all the historical data without going insane.

If I went back to add casual matches, I’d want the process to be automated, and that’s not easy to do for forum threads, as opposed to a digital version: the end of the match needn’t coincide with the last posts in the thread, or there may be a finals thread where the last two players play several matches without starting new threads, or people occasionally do follow-up casual matches in the same thread, or people misspell their own name in the thread name, or the thread name gives the wrong decks, etc. If a digital version appeared, a lot of these problems disappear.

What wouldn’t disappear is that occasionally I need to make a judgement call on whether a match should be included in the model: matches won on timeout aren’t currently included, but I also exclude matches by marking them as a forfeit, for reasons that can’t always be generalised. For example, this match is a clear candidate for removing, since skiTTer resigned on turn 1 when she saw she was facing Nightmare, so it has no information regarding deck strengths. More subjectively I’ve also excluded a lot of one player’s games (petE), because they realised they’d been playing incorrectly since they’d joined.

I don’t know how much it would affect things to add casual match data too, but see the monocolour matchup plot for Metalize’s data three posts up to see what sort of effect it might have.

Final CAMS prediction results are up. I’ll write a post about them when I have time, but generally the model did pretty well.

1 Like

Just before I posted this, I realised I had a few matches where I’d recorded players with the wrong decks, so I had to go back and fix my automated data checks. It doesn’t seem to have changed the results much.

For those who are interested, I’ve split off the data, data checks, and deck name converters, into a separate module, and put it on Github here.

Calibration

We’re giving the win probability for Player 1 in each match, not just the favoured player. If the model thinks the match is 7-3, we expect to see Player 1 win 70% of those matches, not all of them.

This means that, on the plot giving P1 win probabilities, we expect the P2 wins to gradually become more common as the probability decreases, not a sharp cut-off at 50%.
matchup
This is roughly what we see. That P2 victory at c. 85% P1 win probability in round 3 is an upset, sure, but we expect to see an few upsets at that probability level.

(If you’re wondering why round 2 only seems to have 3 matches, that purple dot in is actually two purple dots right on top of each other.)

The only thing I’d be concerned with here is that we don’t see any P1 wins at low probabilities. However, there weren’t many matches that favoured P2 anyway, so this might just be due to random chance, given the low sample size.

Match closeness in late tournament

One thing I’ve been interested in tracking on this tournament tracker is whether we see matches in later rounds being more fair, i.e. closer to 50% P1 win probability, as the field narrows to the most competitive player/deck pairings.
fairness
For CAMS20, I’m not really convinced this was the case. However, the number of matches seen at later rounds is always lower, so this could be to random chance too. To get a better idea of how fairness changes as a tournament goes on, I’d have to look at several tournaments at once, unless we get a tournament with several dozen players.

Brier score

Reminder: the Brier score is a number between 0 and 1; the smaller the Brier score, the better. It’s a proper scoring rule: if you knew the true result probability, you optimise your expected score by giving that true probability as your prediction. If you give a prediction of 5-5, you’re guaranteed a score of 0.25, so this is a good benchmark that your predictions should improve on.

On the evaluation page, I give both the observed Brier score, and the “prior expected score”: assuming the predictions are correct, what is the average score we’d expect to see? This assumes the matches are fixed, so ignores the fact that, if matches had ended differently, we’d see different matches later.

forecast Brier score
always 5-5 0.250
prior expected score 0.194
model 0.173

The model is clearly doing better than flipping a coin. But how much better? I personally find it quite hard to tell when there are several models to compare, let alone when we’re looking at the score for a single model. Next time, I’m going to break the score up into more intuitive parts, and see whether that helps.

However, if we again assume that the model’s predictions are correct, we can look at how likely we are to see the observed score. If it turns out to be highly extreme, then the model is way off in its predictions. Since we can calculate the exact distribution of the score if the predictions are correct, we can plot the distribution, with a vertical line indicating the observed value:

The observed score is in a highly-likely region, so there’s no reason from this to think that the model is doing something badly wrong.

Current plans

Still, I’m not quite happy with the model as it stands. There’s no accounting for synergies between starters/specs in a deck whatsoever, which should have a huge effect on a deck’s strength. The absence of synergies is probably why there are some odd results, like Nightmare and Miracle Grow not being considered especially great decks.

I’m looking to have a new model version that includes within-deck synergies by CAWS20. But first, I’ll be adding handling for the “new” starters/specs that will be appearing in XCAFS20. I won’t be doing predictions while it’s on, since there’s no starter/spec match history for the model to work with.

4 Likes

The new model version will take a while, so in the meantime I’ve added some more plots/tables for the monocolour matchups. There are now matchup evaluations averaged over turn order (“general matchup”), and evaluations for how dependent matchups are on who goes first.

As usual, more results on the site (this page), so here are some stand-outs:

Fairest general matchups (apart from mirrors, which are always perfectly fair):

Deck 1 Deck 2 Deck 1 win probability matchup fairness
MonoGreen MonoRed 0.500 5.0-5.0 1.00
MonoBlue MonoPurple 0.540 5.4-4.6 0.92
MonoPurple MonoRed 0.459 4.6-5.4 0.92
MonoBlue MonoGreen 0.456 4.6-5.4 0.91
MonoBlue MonoWhite 0.451 4.5-5.5 0.90

Unfairest general matchups:

Deck 1 Deck 2 Deck 1 win probability matchup fairness
MonoBlack MonoGreen 0.722 7.2-2.8 0.56
MonoPurple MonoWhite 0.680 6.8-3.2 0.64
MonoBlack MonoWhite 0.627 6.3-3.7 0.75
MonoBlack MonoBlue 0.604 6.0-4.0 0.79
MonoRed MonoWhite 0.412 4.1-5.9 0.82

Surprise surprise, Black has many of the most lopsided matchups.

Fairest matchups WRT turn order:

Deck 1 Deck 2 P1 win probability matchup fairness
MonoPurple MonoWhite 0.507 5.1-4.9 0.99
MonoBlack MonoPurple 0.509 5.1-4.9 0.98
MonoGreen MonoGreen 0.487 4.9-5.1 0.97
MonoRed MonoRed 0.481 4.8-5.2 0.96
MonoBlack MonoGreen 0.478 4.8-5.2 0.96

Most P1-slanted matchups:

P1 P2 P1 win probability matchup fairness
MonoBlue MonoGreen 0.757 7.6-2.4 0.49
MonoBlack MonoWhite 0.696 7.0-3.0 0.61
MonoWhite MonoWhite 0.668 6.7-3.3 0.66
MonoBlack MonoRed 0.642 6.4-3.6 0.72
MonoRed MonoWhite 0.612 6.1-3.9 0.78

Most P2-slanted matchups:

P1 P2 P1 win probability matchup fairness
MonoGreen MonoPurple 0.309 3.1-6.9 0.62
MonoPurple MonoPurple 0.391 3.9-6.1 0.78
MonoBlue MonoPurple 0.411 4.1-5.9 0.82
MonoBlack MonoBlue 0.456 4.6-5.4 0.91
MonoBlue MonoBlue 0.466 4.7-5.3 0.93

The above results don’t account for uncertainty at all, I’m just using the mean matchup. Comments welcome.

I’ve updated the Nash equilibria again, here are the 2D “tier lists” with the resulting Nash weights.

Strength, Growth, and Balance are down. Ninjitsu (?), Bashing (??), Blood, and Demonology are up.

1 Like

Why is Bashing so high on this plot? Is this reflecting the re balance tourney?