Codex data thread

Slicing the data 36 ways, how many matches of data does each matchup have?

It would surprise me if there really were so many matchups in the 70-30 & worse range, given the design goals & how long Codex was in dev.

There are 317 recorded matches, but 14 ended in a timeout and I don’t use them. Additionally, 118 of the remaining matches have Bashing or Finesse present. Roughly, that leaves 250 matches’ worth of data that is relevant to the mono-colour matches, about 7 matches per matchup. They’re nowhere near evenly distribution though: Black specs are more popular, for example, and some MMM1 matchups were never played.

As far as the number of lopsided matches goes, there are two things happening:

  1. Amount of data. Obviously, this will improve. Not just number of matches, but as I add more tournaments and players vary their decks, player skills will be better disentangled from deck strengths (not perfectly, good players also tend to pick good decks). It’ll be a slow improvement for the mono-colour case, because almost no one uses mono-colour decks in tournaments. Time for an MMM revival? There are also some recent-ish tournaments I haven’t added yet, i.e. RACE.
  2. I haven’t yet put enough thought into the prior beliefs for the possible spread of player skills and deck strengths. I want to set them up along the lines of “all/most of the matchups between evenly-skilled players should fall in this range”, and said range should be wider than we’d like it to be for Sirlin’s balance standards, so the model isn’t a priori dismissing the possibility that it’s not balanced as we hope. It should be surprising, but not impossible. I need to sit down and think through this first, because my intuition for what sort of range to have isn’t very good at the moment.

Another aspect to the prior is that I might just be underestimating the spead in player skills, which would make the model increase the posterior deck strength spread to compensate. Given I’m only using tournament games at the moment, I was thinking player skill spread wouldn’t be as high as we’d expect from general players / casual games, but I could be way off-base in thinking that.

EDIT: To compare, there are just over 13,000 recorded tournament matches for Yomi right now, across 20 characters with no concept of turn-order, so 210 matchups. That’s about 62 matches per matchup, about 9 times the amount of data per matchup.

2 Likes

When I was running initial statistics, I dropped all the data from the top 3 and bottom 3 players, as otherwise their deck choices were dominating the results (ie Bashing went 9-1 in my data because I was the only person to use it, and mono-green did really badly because it had records of (approximately) 2-3, 3-3, 0-3, 0-3, 0-3 where the pile of 0-3 results were from players with few wins with any deck).

I didn’t have enough data points to run a baysean analysis, and tease out how much “better than personal average” spec choices made for each player.

2 Likes

Sounds reasonable. How many players was that with? Is that up on the forum somewhere?

I had it in a local spreadsheet where I was tracking tournament results (back when I was running the events). I didn’t have enough data to share, so it was just for my edification.

1 Like

EricF’s old player rankings for the 2016 CASS series, which is longer ago than I’ve currently recorded, can be found here, if people want to compare.

1 Like

I hadn’t taken a close look at your most recent post, @charnel_mouse, but it’s really awesome to see this project’s progress. This is a great contribution to the community

I wonder if there’d be any interest in a “peer review” games series, where we take the ten “least fair” matchups you listed in here, and try out playing ten games of just that matchup with a few different players at the helm of each. I agree with @Persephone that some of those matchups don’t jump out at me as intuitively lopsided.

I will say, I think your player skill model is honing in pretty well. Seems like a decent rank order w/ according confidence intervals.

2 Likes

Thanks, @FrozenStorm! Yeah, some of those matchups look far too lopsided, don’t they? I’m a bit happier with their general directions now, though.

I’d be pretty happy if there was some sort of “peer review” series. Sort of a targeted MMM? I’d be up for running / helping out with that.

I had been vaguely thinking of a setup where I take some willing players and give them decks that the model thinks would result in an even matchup, or matchups the model was least certain about. However, I was thinking of doing this over all legal decks, not just monocolour. In the model’s current state, this would probably result in so many lopsided matches that I didn’t think it would be fair to the players.

1 Like

A quick update on the changes after CAMS19.

Player skill

Player rank

bansa jumps up the board after coming second with a Blue starter. Persephone enters on the higher end of the board after a strong performance with MonoGreen (I have no match data yet from before her hiatus).

Monocolour matchups

Monocolour matchup details
P1 deck P2 deck P1 win probability matchup fairness
1 Green Purple 0.493 4.9-5.1 0.99
2 Black Black 0.519 5.2-4.8 0.96
3 Black Purple 0.481 4.8-5.2 0.96
4 Green Red 0.519 5.2-4.8 0.96
5 Blue Blue 0.473 4.7-5.3 0.95
6 Blue Purple 0.473 4.7-5.3 0.95
7 White Blue 0.461 4.6-5.4 0.92
8 Purple Purple 0.548 5.5-4.5 0.90
9 White Black 0.439 4.4-5.6 0.88
10 Blue Red 0.569 5.7-4.3 0.86
11 Green Green 0.431 4.3-5.7 0.86
12 Red Black 0.572 5.7-4.3 0.86
13 Green White 0.581 5.8-4.2 0.84
14 Red White 0.409 4.1-5.9 0.82
15 White White 0.591 5.9-4.1 0.82
16 Red Green 0.610 6.1-3.9 0.78
17 Red Blue 0.615 6.1-3.9 0.77
18 Red Red 0.620 6.2-3.8 0.76
19 Blue Green 0.627 6.3-3.7 0.75
20 Blue White 0.630 6.3-3.7 0.74
21 Purple Red 0.367 3.7-6.3 0.73
22 White Red 0.655 6.6-3.4 0.69
23 Black Red 0.663 6.6-3.4 0.67
24 Purple Black 0.323 3.2-6.8 0.65
25 Black Blue 0.693 6.9-3.1 0.61
26 Purple Blue 0.300 3.0-7.0 0.60
27 Black White 0.709 7.1-2.9 0.58
28 Blue Black 0.289 2.9-7.1 0.58
29 White Purple 0.264 2.6-7.4 0.53
30 Green Black 0.262 2.6-7.4 0.52
31 Black Green 0.745 7.5-2.5 0.51
32 White Green 0.743 7.4-2.6 0.51
33 Green Blue 0.753 7.5-2.5 0.49
34 Purple Green 0.240 2.4-7.6 0.48
35 Red Purple 0.801 8.0-2.0 0.40
36 Purple White 0.824 8.2-1.8 0.35

The main difference here is that P1 MonoPurple is considered to take a beating from P2 MonoGreen (2.4-7.6). This is because the matchup didn’t have much evidence before, and CAMS19 had the following relevant matches:

  • Unity [Balance/Growth]/Present vs. Legion [Anarchy]/Growth/Strength, Legion wins
  • FrozenStorm [Future/Past]/Finesse vs. Unity [Balance/Growth]/Present, Unity wins
  • Unity [Balance/Growth]/Present vs. Persephone MonoGreen, Persephone wins
  • FrozenStorm [Future/Past]/Finesse vs. charnel_mouse [Balance]/Blood/Strength, charnel_mouse wins

All P2 wins.

Current monocolour Nash equilibrium:
P1: Pick Black or Red at 16:19.
P2: Pick Black or White at 6:1.
Value: About 5.5-4.5, slightly P1-favoured.
Slight change in overall matchup, both players favour Black much less than before.

If the turn order is unknown, then both players have the same equilibrium strategy: pick Black, all the time, since it outperforms all the other monodecks when averaged over turn order.

Next update should be for Metalize’s data.

I’m confused… these aren’t monocolor matchups, so why are they being considered for those monocolor probabilities? They indicate something about the starter choice, perhaps, but certainly not monocolor matchups…

Because the strength of a deck against another is modelled as the sum of 16 strength components:

  • P1 starter vs. P2 starter (1 component, e.g. Green vs. Red)
  • P1 starter vs. P2 specs (3 components, e.g. Green vs. Anarchy, Green vs. Growth, and Green vs. Strength)
  • P1 specs vs. P2 starter (3 components, e.g. Balance vs. Red, Growth vs. Red, Present vs. Red)
  • P2 specs vs. P2 specs (9 components I won’t list here, but which include Present vs. Growth for the first match)

For the example first match, only the Present vs. Growth component is relevant to the MonoPurple vs. MonoGreen matchup, but the second match has nine relevant components: any of Purple starter, Future spec, Past spec, vs. any of Green starter, Balance spec, Growth spec.

This allows a starter/spec’s performance in a deck to also inform its likely performance in other decks. I think this is reasonable, and if we don’t do this, the amount of information on most decks, including monodecks, is miniscule.

I originally introduced this approach here, albeit not in much detail:

1 Like

I think Frozen is reasonably concerned that, without looking at the inner workings of the actual match, its a bit of a stretch to say present and growth actually faced off in any meaningful way. That said, such analysis of the inner workings of a match are well beyond the scope of this data compilation.

In a game where crashbarrows won the day, and the opponent was actively using strength, the fact that growth was an alternate option doesn’t directly mean growth is weak against crashbarrows, but taken together with player skill factors, the more skilled players will choose what they think is the best strategy vs a given deck and it will all even out in the long run.

In any event, I think the statistics are still informative.

3 Likes

Yeah, I can’t really evaluate whether they faced off without doing something much more complicated. The mere presence of a spec could shape the match without it ever being used, though, and the hero can affect things too. The meaning of the Present vs. Growth component is not how well those specs face off, it’s a partial evaluation of how well decks including them face off. You couldn’t just take the neutral components to evaluate the 1-spec starter game format, for example.

2 Likes

This was a ton of work! Thanks for attempting to do this. Also, I would be interested in testing 10-match mono-color intervals (with advice) to further strengthen match-up assertions.

1 Like

Thanks! I’m going to wait and see how the results for the current tournament affect things, it should reduce the confounding between strong players and strong decks a bit. Then I’ll put up the “lopsided” matches, minus Black vs. Blue, up in a thread, and see who’s interested. If you mean advice from other players, I was thinking of suggesting some warm-up matches, à la MMM1, but not enforcing it, so that could be a good time for that sort of thing.

I’m not sure about the exact format yet. @FrozenStorm was your suggestion above to take one matchup at a time, and have different pairs of players play one game each for a total series of ten? If we do it like that, I could put up the lopsided matches and do a poll on which one to do first/next, and it’s a bit less time-demanding than two players doing ten matches.

1 Like

@charnel_mouse re-reading my previous post, I think the format I had in mind was something more akin to MMM’s format, where a sign-ups sheet is posted, either to just get a pool of players to be assigned decks, or more exactly like MMM for players to sign up for specific matchups (I think I prefer the latter, as this is meant in my eyes to be an opportunity to “challenge” what’s listed as a “bad matchup” by proving it more even.)

So something like:

  1. You compile a list of the 5-10 most “interesting” pieces of your data (perhaps we can have a Discord call w/ some experienced players to sort out which results from the data are most surprising to us, or put up a poll)
  2. Players then are invited to posit a challenge to those data (i.e. you have Purple vs Green as 2-8 and I think it’s much closer to even, so I volunteer to play purple. Or perhaps your data has Blue vs Red to be 6-4 and someone thinks it’s closer to 3-7 the opposite way. I’m making those numbers up but you catch my drift)
  3. Players w/ at least some experience on either deck are invited to help play out a 10-game series of that (I’d bring @Shadow_Night_Black in to help out with Purple maybe, for example, and I think @Mooseknuckles has played a fair share of Green starter at least? And @EricF and @zhavier are just really strong players who have played a fair bit of Growth and could probably puzzle out great plays? ). It could be where we play in council with one another, or we just split up who plays how many games on which decks

A bit sloppy of an idea, I’ll admit, but at least a means of seeing w/ experienced players “is this data reflective of more targeted testing?”

4 Likes

That sounds reasonable. Would it help if I also drew up the model’s matchup predictions when the players are taken into account? Re: Discord calls, if you wanted me on the call too we’d have to work around my being on GMT.

1 Like

While we’re waiting for XCAFRR19 to finish up, I thought I’d put up what the model retroactively considers to be highlights from CAMS19 – the most recent standard tournament – and MMM1. The model currently considers player skills to be about twice as important as the decks in a matchup, so the MMM1 highlights might be helpful to provide context to some of the recent monocolour deck results.

CAMS19 fairest matches

match name result probability fairness
CAMS19 R4 FrozenStorm [Future/Past]/Finesse vs. charnel_mouse [Balance]/Blood/Strength, won by charnel_mouse 0.50 0.99
CAMS19 R3 bansa [Law/Peace]/Finesse vs. zhavier Miracle Grow, won by bansa 0.51 0.98
CAMS19 R10 bansa [Law/Peace]/Finesse vs. zhavier Miracle Grow, won by zhavier 0.49 0.98
CAMS19 R3 codexnewb Nightmare vs. Legion Miracle Grow, won by codexnewb 0.53 0.95
CAMS19 R6 Persephone MonoGreen vs. bansa [Law/Peace]/Finesse, won by bansa 0.47 0.94

CAMS19 upsets

match name result probability fairness
CAMS19 R2 Legion Miracle Grow vs. EricF [Anarchy/Blood]/Demonology, won by Legion 0.39 0.79
CAMS19 R9 EricF [Anarchy/Blood]/Demonology vs. zhavier Miracle Grow, won by zhavier 0.46 0.92
CAMS19 R6 Persephone MonoGreen vs. bansa [Law/Peace]/Finesse, won by bansa 0.47 0.94
CAMS19 R10 bansa [Law/Peace]/Finesse vs. zhavier Miracle Grow, won by zhavier 0.49 0.98
CAMS19 R4 FrozenStorm [Future/Past]/Finesse vs. charnel_mouse [Balance]/Blood/Strength, won by charnel_mouse 0.50 0.99

MMM1 fairest matches

match name P1 win probability fairness observed P1 win rate
zhavier MonoGreen vs. EricF MonoWhite 0.51 0.99 0.6 (3/5)
FrozenStorm MonoBlue vs. Dreamfire MonoRed 0.49 0.99 0.6 (3/5)
Bob199 MonoBlack vs. FrozenStorm MonoWhite 0.47 0.95 0.6 (3/5)
cstick MonoGreen vs. codexnewb MonoBlack 0.53 0.94 0.6 (3/5)
codexnewb MonoBlack vs. cstick MonoGreen 0.45 0.90 0.4 (2/5)

I’d recommend looking at the P1 Black vs. P2 White matchup, and both Black/Green matchups, in the latest monocolour matchup plot, to see how much the players involved can change a matchup.

MMM1 unfairest matchups

match name P1 win probability fairness observed P1 win rate
HolyTispon MonoPurple vs. Dreamfire MonoBlue 0.08 0.15 0.0 (0/5)
Nekoatl MonoBlue vs. FrozenStorm MonoBlack 0.09 0.17 0.0 (0/5)
FrozenStorm MonoBlack vs. Nekoatl MonoBlue 0.90 0.20 1.0 (5/5)
Shadow_Night_Black MonoPurple vs. Bob199 MonoWhite 0.87 0.27 1.0 (5/5)
Bob199 MonoWhite vs. Shadow_Night_Black MonoPurple 0.16 0.32 0.0 (0/5)
Dreamfire MonoBlue vs. HolyTispon MonoPurple 0.80 0.40 0.8 (4/5)
EricF MonoWhite vs. zhavier MonoGreen 0.78 0.43 1.0 (5/5)

A quick update after adding the results from XCAFRR19.

The player chart is getting a little crowded with all the new players recently – hooray! – so I’ve trimmed it to only show players that were active in 2018–2019.

Player skill

Player skill is currently not dependent on turn order, time, or decks used.

Monocolour matchups

Row is for P1, column is for P2. Player skills are assumed to be equal.

Monocolour matchup details

Sorted by matchup fairness.

P1 deck P2 deck P1 win probability matchup fairness
Green Purple 0.501 5.0-5.0 1.00
Red Black 0.503 5.0-5.0 0.99
Green Red 0.493 4.9-5.1 0.99
White Blue 0.484 4.8-5.2 0.97
Purple Purple 0.522 5.2-4.8 0.96
Blue Red 0.541 5.4-4.6 0.92
Purple Red 0.459 4.6-5.4 0.92
Blue Blue 0.453 4.5-5.5 0.91
White Black 0.443 4.4-5.6 0.89
Black Black 0.426 4.3-5.7 0.85
Blue Purple 0.427 4.3-5.7 0.85
Red Green 0.581 5.8-4.2 0.84
Blue White 0.583 5.8-4.2 0.83
White White 0.584 5.8-4.2 0.83
Red Red 0.595 6.0-4.0 0.81
Green White 0.593 5.9-4.1 0.81
Black Purple 0.602 6.0-4.0 0.80
Green Green 0.378 3.8-6.2 0.76
Red Blue 0.631 6.3-3.7 0.74
Red White 0.367 3.7-6.3 0.74
Blue Green 0.635 6.4-3.6 0.73
Black Blue 0.668 6.7-3.3 0.66
Black Red 0.670 6.7-3.3 0.66
White Red 0.677 6.8-3.2 0.65
Purple Blue 0.308 3.1-6.9 0.62
Blue Black 0.306 3.1-6.9 0.61
Red Purple 0.712 7.1-2.9 0.58
Black White 0.709 7.1-2.9 0.58
Green Blue 0.721 7.2-2.8 0.56
White Green 0.726 7.3-2.7 0.55
Purple Black 0.253 2.5-7.5 0.51
Black Green 0.745 7.5-2.5 0.51
White Purple 0.241 2.4-7.6 0.48
Green Black 0.237 2.4-7.6 0.47
Purple Green 0.203 2.0-8.0 0.41
Purple White 0.798 8.0-2.0 0.40

Some highlights from XCAFFR19. These are retrospective, i.e. after including their results in the (training) data.

Fairest XCAFFR19 matches
match name result probability fairness
XCAFRR19 R1 FrozenStorm [Discipline]/Past/Peace vs. codexnewb [Future]/Anarchy/Peace, won by codexnewb 0.50 1.00
XCAFRR19 R7 Bomber678 [Feral/Growth]/Disease vs. James MonoBlack, won by James 0.50 1.00
XCAFRR19 R4 Nekoatl [Balance/Growth]/Disease vs. James [Disease/Necromancy]/Law, won by Nekoatl 0.49 0.99
XCAFRR19 R8 Nekoatl [Feral/Growth]/Disease vs. charnel_mouse [Discipline]/Law/Necromancy, won by charnel_mouse 0.52 0.96
XCAFRR19 R3 bolyarich [Feral/Growth]/Disease vs. charnel_mouse [Balance/Growth]/Disease, won by charnel_mouse 0.52 0.95
XCAFRR19 R4 EricF [Fire]/Disease/Truth vs. bolyarich MonoBlack, won by bolyarich 0.47 0.94
XCAFRR19 R5 codexnewb [Future]/Anarchy/Peace vs. OffKilter [Fire]/Growth/Present, won by OffKilter 0.46 0.93
XCAFRR19 R1 UrbanVelvet [Anarchy]/Past/Strength vs. CarpeGuitarrem Nightmare, won by UrbanVelvet 0.54 0.93
XCAFRR19 R2 codexnewb Nightmare vs. dwarddd MonoPurple, won by codexnewb 0.54 0.92
XCAFRR19 R5 Leaky MonoPurple vs. CarpeGuitarrem [Demonology]/Growth/Strength, won by CarpeGuitarrem 0.55 0.91
XCAFFR19 upsets
match name result probability fairness
XCAFRR19 R7 codexnewb MonoPurple vs. zhavier [Anarchy]/Past/Strength, won by codexnewb 0.36 0.73
XCAFRR19 R6 zhavier [Anarchy]/Past/Strength vs. FrozenStorm MonoPurple, won by FrozenStorm 0.40 0.80
XCAFRR19 R3 codexnewb [Future]/Anarchy/Peace vs. Leaky [Discipline]/Disease/Law, won by Leaky 0.41 0.81
XCAFRR19 R5 codexnewb [Future]/Anarchy/Peace vs. OffKilter [Fire]/Growth/Present, won by OffKilter 0.46 0.93
XCAFRR19 R4 EricF [Fire]/Disease/Truth vs. bolyarich MonoBlack, won by bolyarich 0.47 0.94
XCAFRR19 R4 Nekoatl [Balance/Growth]/Disease vs. James [Disease/Necromancy]/Law, won by Nekoatl 0.49 0.99
XCAFRR19 R1 FrozenStorm [Discipline]/Past/Peace vs. codexnewb [Future]/Anarchy/Peace, won by codexnewb 0.50 1.00

Here’s something new: the model estimates the variance in player skills and opposed deck components, in terms of their effect on the matchup. This means that I can directly compare the effect different things have on a matchup:

Matchup variance breakdown

The boxes are scaled according to how many elements of that type go into each matchup: 2 players, 1 starter vs. starter, 6 starter vs. spec, 9 spec vs. spec. On average, the players’ skill levels have twice the effect on the matchup that the decks do.

Roughly speaking, that means that, if we took the multi-colour decks with the most lopsided matchup in the model, and gave the weak deck to an extremely strong player, e.g. EricF, and gave the strong deck to an average-skill player, e.g. codexnewb (average by forum-player standards, remember), we’d expect the matchup to be roughly even.

As always, comments or criticism about the results above are welcome. Let me know if there are particular games you’d like the results for too, although those should be available to view on the new site soon.


I’m in the process of making a small personal website, so I can stick the model results somewhere where I have more presentation format options. In particular, I look at the model’s match predictions using JavaScript-style DataTables, where you can sort and search on different fields (match name, model used, match fairness, etc.), so it would be nice if other people could use them too.

It will also let me more easily make the inner model workings more transparent. When I get time, I’m planning to add ways to examine cases of interest, like the Nash equilibrium among a given set of decks, or a way to view the chances of different decks against a given opposing deck and with given players. The latter, in particular, would be useful to let other people evaluate the model for matches that they’re playing.

I have other versions of the model that use Metalize’s data, but I’m going to delay showing those until I’ve finished putting up the site, and have tidied up the data a bit.

4 Likes

OK, it’s pretty rough, but I now have a first site version up here. I had some grappling with DataTables and Hugo to do, so I should be able to add things more easily after this first version.

Things I’m planning to add first:

  • Results for other versions of the model, to show how it’s improved since the start of the project.
  • A match data CSV, and R/Stan code.
  • Downloadable simulation results data for people to play with, if anyone’s interested.
  • Interactive plots. I’d like to have something to put in players and decks and see the predicted outcome distribution, like vengefulpickle’s Yomi page, instead of having to fiddle with column filters on the DataTable in the Opposed component effects section.
  • Tighten up the wording to make it more clear to people who haven’t been following the thread.

I’ll still do update posts here about model improvements, interesting results etc.; the site’s there for people to play around with the model themselves.

2 Likes