News Shop
Events Chat

Codex data thread

Yeah, I can’t really evaluate whether they faced off without doing something much more complicated. The mere presence of a spec could shape the match without it ever being used, though, and the hero can affect things too. The meaning of the Present vs. Growth component is not how well those specs face off, it’s a partial evaluation of how well decks including them face off. You couldn’t just take the neutral components to evaluate the 1-spec starter game format, for example.


This was a ton of work! Thanks for attempting to do this. Also, I would be interested in testing 10-match mono-color intervals (with advice) to further strengthen match-up assertions.

1 Like

Thanks! I’m going to wait and see how the results for the current tournament affect things, it should reduce the confounding between strong players and strong decks a bit. Then I’ll put up the “lopsided” matches, minus Black vs. Blue, up in a thread, and see who’s interested. If you mean advice from other players, I was thinking of suggesting some warm-up matches, à la MMM1, but not enforcing it, so that could be a good time for that sort of thing.

I’m not sure about the exact format yet. @FrozenStorm was your suggestion above to take one matchup at a time, and have different pairs of players play one game each for a total series of ten? If we do it like that, I could put up the lopsided matches and do a poll on which one to do first/next, and it’s a bit less time-demanding than two players doing ten matches.

1 Like

@charnel_mouse re-reading my previous post, I think the format I had in mind was something more akin to MMM’s format, where a sign-ups sheet is posted, either to just get a pool of players to be assigned decks, or more exactly like MMM for players to sign up for specific matchups (I think I prefer the latter, as this is meant in my eyes to be an opportunity to “challenge” what’s listed as a “bad matchup” by proving it more even.)

So something like:

  1. You compile a list of the 5-10 most “interesting” pieces of your data (perhaps we can have a Discord call w/ some experienced players to sort out which results from the data are most surprising to us, or put up a poll)
  2. Players then are invited to posit a challenge to those data (i.e. you have Purple vs Green as 2-8 and I think it’s much closer to even, so I volunteer to play purple. Or perhaps your data has Blue vs Red to be 6-4 and someone thinks it’s closer to 3-7 the opposite way. I’m making those numbers up but you catch my drift)
  3. Players w/ at least some experience on either deck are invited to help play out a 10-game series of that (I’d bring @Shadow_Night_Black in to help out with Purple maybe, for example, and I think @Mooseknuckles has played a fair share of Green starter at least? And @EricF and @zhavier are just really strong players who have played a fair bit of Growth and could probably puzzle out great plays? ). It could be where we play in council with one another, or we just split up who plays how many games on which decks

A bit sloppy of an idea, I’ll admit, but at least a means of seeing w/ experienced players “is this data reflective of more targeted testing?”


That sounds reasonable. Would it help if I also drew up the model’s matchup predictions when the players are taken into account? Re: Discord calls, if you wanted me on the call too we’d have to work around my being on GMT.

1 Like

While we’re waiting for XCAFRR19 to finish up, I thought I’d put up what the model retroactively considers to be highlights from CAMS19 – the most recent standard tournament – and MMM1. The model currently considers player skills to be about twice as important as the decks in a matchup, so the MMM1 highlights might be helpful to provide context to some of the recent monocolour deck results.

CAMS19 fairest matches

match name result probability fairness
CAMS19 R4 FrozenStorm [Future/Past]/Finesse vs. charnel_mouse [Balance]/Blood/Strength, won by charnel_mouse 0.50 0.99
CAMS19 R3 bansa [Law/Peace]/Finesse vs. zhavier Miracle Grow, won by bansa 0.51 0.98
CAMS19 R10 bansa [Law/Peace]/Finesse vs. zhavier Miracle Grow, won by zhavier 0.49 0.98
CAMS19 R3 codexnewb Nightmare vs. Legion Miracle Grow, won by codexnewb 0.53 0.95
CAMS19 R6 Persephone MonoGreen vs. bansa [Law/Peace]/Finesse, won by bansa 0.47 0.94

CAMS19 upsets

match name result probability fairness
CAMS19 R2 Legion Miracle Grow vs. EricF [Anarchy/Blood]/Demonology, won by Legion 0.39 0.79
CAMS19 R9 EricF [Anarchy/Blood]/Demonology vs. zhavier Miracle Grow, won by zhavier 0.46 0.92
CAMS19 R6 Persephone MonoGreen vs. bansa [Law/Peace]/Finesse, won by bansa 0.47 0.94
CAMS19 R10 bansa [Law/Peace]/Finesse vs. zhavier Miracle Grow, won by zhavier 0.49 0.98
CAMS19 R4 FrozenStorm [Future/Past]/Finesse vs. charnel_mouse [Balance]/Blood/Strength, won by charnel_mouse 0.50 0.99

MMM1 fairest matches

match name P1 win probability fairness observed P1 win rate
zhavier MonoGreen vs. EricF MonoWhite 0.51 0.99 0.6 (3/5)
FrozenStorm MonoBlue vs. Dreamfire MonoRed 0.49 0.99 0.6 (3/5)
Bob199 MonoBlack vs. FrozenStorm MonoWhite 0.47 0.95 0.6 (3/5)
cstick MonoGreen vs. codexnewb MonoBlack 0.53 0.94 0.6 (3/5)
codexnewb MonoBlack vs. cstick MonoGreen 0.45 0.90 0.4 (2/5)

I’d recommend looking at the P1 Black vs. P2 White matchup, and both Black/Green matchups, in the latest monocolour matchup plot, to see how much the players involved can change a matchup.

MMM1 unfairest matchups

match name P1 win probability fairness observed P1 win rate
HolyTispon MonoPurple vs. Dreamfire MonoBlue 0.08 0.15 0.0 (0/5)
Nekoatl MonoBlue vs. FrozenStorm MonoBlack 0.09 0.17 0.0 (0/5)
FrozenStorm MonoBlack vs. Nekoatl MonoBlue 0.90 0.20 1.0 (5/5)
Shadow_Night_Black MonoPurple vs. Bob199 MonoWhite 0.87 0.27 1.0 (5/5)
Bob199 MonoWhite vs. Shadow_Night_Black MonoPurple 0.16 0.32 0.0 (0/5)
Dreamfire MonoBlue vs. HolyTispon MonoPurple 0.80 0.40 0.8 (4/5)
EricF MonoWhite vs. zhavier MonoGreen 0.78 0.43 1.0 (5/5)

A quick update after adding the results from XCAFRR19.

The player chart is getting a little crowded with all the new players recently – hooray! – so I’ve trimmed it to only show players that were active in 2018–2019.

Player skill

Player skill is currently not dependent on turn order, time, or decks used.

Monocolour matchups

Row is for P1, column is for P2. Player skills are assumed to be equal.

Monocolour matchup details

Sorted by matchup fairness.

P1 deck P2 deck P1 win probability matchup fairness
Green Purple 0.501 5.0-5.0 1.00
Red Black 0.503 5.0-5.0 0.99
Green Red 0.493 4.9-5.1 0.99
White Blue 0.484 4.8-5.2 0.97
Purple Purple 0.522 5.2-4.8 0.96
Blue Red 0.541 5.4-4.6 0.92
Purple Red 0.459 4.6-5.4 0.92
Blue Blue 0.453 4.5-5.5 0.91
White Black 0.443 4.4-5.6 0.89
Black Black 0.426 4.3-5.7 0.85
Blue Purple 0.427 4.3-5.7 0.85
Red Green 0.581 5.8-4.2 0.84
Blue White 0.583 5.8-4.2 0.83
White White 0.584 5.8-4.2 0.83
Red Red 0.595 6.0-4.0 0.81
Green White 0.593 5.9-4.1 0.81
Black Purple 0.602 6.0-4.0 0.80
Green Green 0.378 3.8-6.2 0.76
Red Blue 0.631 6.3-3.7 0.74
Red White 0.367 3.7-6.3 0.74
Blue Green 0.635 6.4-3.6 0.73
Black Blue 0.668 6.7-3.3 0.66
Black Red 0.670 6.7-3.3 0.66
White Red 0.677 6.8-3.2 0.65
Purple Blue 0.308 3.1-6.9 0.62
Blue Black 0.306 3.1-6.9 0.61
Red Purple 0.712 7.1-2.9 0.58
Black White 0.709 7.1-2.9 0.58
Green Blue 0.721 7.2-2.8 0.56
White Green 0.726 7.3-2.7 0.55
Purple Black 0.253 2.5-7.5 0.51
Black Green 0.745 7.5-2.5 0.51
White Purple 0.241 2.4-7.6 0.48
Green Black 0.237 2.4-7.6 0.47
Purple Green 0.203 2.0-8.0 0.41
Purple White 0.798 8.0-2.0 0.40

Some highlights from XCAFFR19. These are retrospective, i.e. after including their results in the (training) data.

Fairest XCAFFR19 matches
match name result probability fairness
XCAFRR19 R1 FrozenStorm [Discipline]/Past/Peace vs. codexnewb [Future]/Anarchy/Peace, won by codexnewb 0.50 1.00
XCAFRR19 R7 Bomber678 [Feral/Growth]/Disease vs. James MonoBlack, won by James 0.50 1.00
XCAFRR19 R4 Nekoatl [Balance/Growth]/Disease vs. James [Disease/Necromancy]/Law, won by Nekoatl 0.49 0.99
XCAFRR19 R8 Nekoatl [Feral/Growth]/Disease vs. charnel_mouse [Discipline]/Law/Necromancy, won by charnel_mouse 0.52 0.96
XCAFRR19 R3 bolyarich [Feral/Growth]/Disease vs. charnel_mouse [Balance/Growth]/Disease, won by charnel_mouse 0.52 0.95
XCAFRR19 R4 EricF [Fire]/Disease/Truth vs. bolyarich MonoBlack, won by bolyarich 0.47 0.94
XCAFRR19 R5 codexnewb [Future]/Anarchy/Peace vs. OffKilter [Fire]/Growth/Present, won by OffKilter 0.46 0.93
XCAFRR19 R1 UrbanVelvet [Anarchy]/Past/Strength vs. CarpeGuitarrem Nightmare, won by UrbanVelvet 0.54 0.93
XCAFRR19 R2 codexnewb Nightmare vs. dwarddd MonoPurple, won by codexnewb 0.54 0.92
XCAFRR19 R5 Leaky MonoPurple vs. CarpeGuitarrem [Demonology]/Growth/Strength, won by CarpeGuitarrem 0.55 0.91
XCAFFR19 upsets
match name result probability fairness
XCAFRR19 R7 codexnewb MonoPurple vs. zhavier [Anarchy]/Past/Strength, won by codexnewb 0.36 0.73
XCAFRR19 R6 zhavier [Anarchy]/Past/Strength vs. FrozenStorm MonoPurple, won by FrozenStorm 0.40 0.80
XCAFRR19 R3 codexnewb [Future]/Anarchy/Peace vs. Leaky [Discipline]/Disease/Law, won by Leaky 0.41 0.81
XCAFRR19 R5 codexnewb [Future]/Anarchy/Peace vs. OffKilter [Fire]/Growth/Present, won by OffKilter 0.46 0.93
XCAFRR19 R4 EricF [Fire]/Disease/Truth vs. bolyarich MonoBlack, won by bolyarich 0.47 0.94
XCAFRR19 R4 Nekoatl [Balance/Growth]/Disease vs. James [Disease/Necromancy]/Law, won by Nekoatl 0.49 0.99
XCAFRR19 R1 FrozenStorm [Discipline]/Past/Peace vs. codexnewb [Future]/Anarchy/Peace, won by codexnewb 0.50 1.00

Here’s something new: the model estimates the variance in player skills and opposed deck components, in terms of their effect on the matchup. This means that I can directly compare the effect different things have on a matchup:

Matchup variance breakdown

The boxes are scaled according to how many elements of that type go into each matchup: 2 players, 1 starter vs. starter, 6 starter vs. spec, 9 spec vs. spec. On average, the players’ skill levels have twice the effect on the matchup that the decks do.

Roughly speaking, that means that, if we took the multi-colour decks with the most lopsided matchup in the model, and gave the weak deck to an extremely strong player, e.g. EricF, and gave the strong deck to an average-skill player, e.g. codexnewb (average by forum-player standards, remember), we’d expect the matchup to be roughly even.

As always, comments or criticism about the results above are welcome. Let me know if there are particular games you’d like the results for too, although those should be available to view on the new site soon.

I’m in the process of making a small personal website, so I can stick the model results somewhere where I have more presentation format options. In particular, I look at the model’s match predictions using JavaScript-style DataTables, where you can sort and search on different fields (match name, model used, match fairness, etc.), so it would be nice if other people could use them too.

It will also let me more easily make the inner model workings more transparent. When I get time, I’m planning to add ways to examine cases of interest, like the Nash equilibrium among a given set of decks, or a way to view the chances of different decks against a given opposing deck and with given players. The latter, in particular, would be useful to let other people evaluate the model for matches that they’re playing.

I have other versions of the model that use Metalize’s data, but I’m going to delay showing those until I’ve finished putting up the site, and have tidied up the data a bit.


OK, it’s pretty rough, but I now have a first site version up here. I had some grappling with DataTables and Hugo to do, so I should be able to add things more easily after this first version.

Things I’m planning to add first:

  • Results for other versions of the model, to show how it’s improved since the start of the project.
  • A match data CSV, and R/Stan code.
  • Downloadable simulation results data for people to play with, if anyone’s interested.
  • Interactive plots. I’d like to have something to put in players and decks and see the predicted outcome distribution, like vengefulpickle’s Yomi page, instead of having to fiddle with column filters on the DataTable in the Opposed component effects section.
  • Tighten up the wording to make it more clear to people who haven’t been following the thread.

I’ll still do update posts here about model improvements, interesting results etc.; the site’s there for people to play around with the model themselves.


Oh, and I’ve developed things to the point where I can easily plug in players/decks in a tournament and get matchup predictions. I’ll post up the predictions for CAWS19 matchups after it’s finished – I don’t want to prime anyone (else) by posting them beforehand – and take a look at how good they were.

Next tournament, my aim is to let the model tell me which deck to use. Not to play to win, more to find the model’s flaws.


The version on the site now has a section for Nash equilibria: against an evenly-matched opponent with double-blind deck picks, how should you best randomly decide to deck to pick? I’ve currently done for this for monocolour decks, and for all multicolour decks used in a recorded game. Doing this for all possible multicolour in the current way I do this would currently require me to have a several hundred Gb of RAM, I’m working on a version that doesn’t.

There are results for the usual case of not knowing who’ll go first / alternating turn order in a series, and for the case where you know before picks whether you’re going first or second.

A lot of the results – which I will try to make more intelligible soon – come down to “the model isn’t very certain about anything.” Go to the site if you want to stare at a bunch of tables. Here’s a few highlights, though:

  • For monocolour games, if you’re going first, half the time you’d take Black or Red. If you’re going second, half the time you’d take Black or sometimes Purple. If you don’t know or are alternating turn order, most of the time you’d take Black. All six colours are still present in the picks in all three cases, so still some uncertainty here.
  • For multicolour games with unknown/alternating turn order, the most favoured deck of those recorded is [Necromancy]/Finesse/Strength, but that’s only picked about 4% of the time, so there’s no dominating strategy. In fact, of the 105 decks in recorded matches, all of them have a non-zero chance to be picked. There’s still a lot of uncertainty here, perhaps made most obvious by the fact that there are two decks with Bashing in the ten highest-weighted decks: [Bashing]/Demonology/Necromancy (#2, 3%) and [Bashing]/Fire/Peace (#7, 2%). Compare to MonoBlack (#21, 1.3%) and Nightmare (#30, 1.1%).
  • In the recorded multicolour case, for unknown/alterating turn order, MonoBlack is the highest-weighted monocolour deck, at 1.3% of total weight.
  • MonoGreen generally has the lowest weight in both monocolour and multicolour. Yes, lower than MonoBlue.
  • Nash equilibria in both cases are slightly in favour of Player 1, so slightly that I don’t think it matters much.

I don’t know who was playing Bashing/Fire/Peace, but the #2 deck is being skewed by having a 100% win rate and only getting played in one event.

I know you’re doing something to tease out the player vs deck contributions, but intuitively I would expect a deck with this profile:
Player Average Win % = 55%
Player + Deck Average Win % = 60%
to be a better choice than this profile:
Player Average Win % = 85%
Player + Deck Average Win % = 100%

It should be stressed that none of the decks are particularly strongly favoured, I was just using Bashing’s presence as an example. I could pull out the sampled Nash weights for the #2 deck, and I’d expect them to have a large variance, with the weight often being zero. I expect this for most of the decks, actually: most matchups will have no recorded matches at all, and variability in a small number of them can result in very different Nash equilibria. I especially expect it for any decks containing Bashing, since it’s so rarely used.

Even for decks with a lot of representation, like Nightmare, I’m not sure how helpful looking at their average win rates would be. They’ll still have highly-uncertain matchups, and it wouldn’t take many matchups to flip away from a deck’s favour in a sample for it to be dominated by other decks. For example, if you look at the “greedy” Nash equilibria, which just uses the mean predicted matchups, most of the decks drop out.

EDIT: Even without everything I mentioned above, from my intuition I’d expect the latter deck in your example to be better, at least in the model, because the player and deck effects are modelled as additive on the log-odds scale. It’s therefore much harder to get from 85% to 100%, or even 90%, than it is to get from 55% to 60%. It’s differences in sample size that would make the former deck more attractive.

EDIT2: The first table on the model page should let you search for a deck’s name to find the matches where it was used. Looks like [Bashing]/Fire/Peace was used by flagrantangles in CAWS18.

EDIT3: Final comment. Having little-played decks highly-weighted is actually a feature I’d expect to see. You can see the same thing happening on the plot for player’s probability of being the best player. Having a high mean skill helps, but so does having a high skill variance. Compare FrozenStorm (low variance, so small probability despite a high mean) to various players like deluks917 (low mean, but high variance, so higher probability).