Slicing the data 36 ways, how many matches of data does each matchup have?
It would surprise me if there really were so many matchups in the 70-30 & worse range, given the design goals & how long Codex was in dev.
Slicing the data 36 ways, how many matches of data does each matchup have?
It would surprise me if there really were so many matchups in the 70-30 & worse range, given the design goals & how long Codex was in dev.
There are 317 recorded matches, but 14 ended in a timeout and I don’t use them. Additionally, 118 of the remaining matches have Bashing or Finesse present. Roughly, that leaves 250 matches’ worth of data that is relevant to the mono-colour matches, about 7 matches per matchup. They’re nowhere near evenly distribution though: Black specs are more popular, for example, and some MMM1 matchups were never played.
As far as the number of lopsided matches goes, there are two things happening:
Another aspect to the prior is that I might just be underestimating the spead in player skills, which would make the model increase the posterior deck strength spread to compensate. Given I’m only using tournament games at the moment, I was thinking player skill spread wouldn’t be as high as we’d expect from general players / casual games, but I could be way off-base in thinking that.
EDIT: To compare, there are just over 13,000 recorded tournament matches for Yomi right now, across 20 characters with no concept of turn-order, so 210 matchups. That’s about 62 matches per matchup, about 9 times the amount of data per matchup.
When I was running initial statistics, I dropped all the data from the top 3 and bottom 3 players, as otherwise their deck choices were dominating the results (ie Bashing went 9-1 in my data because I was the only person to use it, and mono-green did really badly because it had records of (approximately) 2-3, 3-3, 0-3, 0-3, 0-3 where the pile of 0-3 results were from players with few wins with any deck).
I didn’t have enough data points to run a baysean analysis, and tease out how much “better than personal average” spec choices made for each player.
Sounds reasonable. How many players was that with? Is that up on the forum somewhere?
I had it in a local spreadsheet where I was tracking tournament results (back when I was running the events). I didn’t have enough data to share, so it was just for my edification.
EricF’s old player rankings for the 2016 CASS series, which is longer ago than I’ve currently recorded, can be found here, if people want to compare.
I hadn’t taken a close look at your most recent post, @charnel_mouse, but it’s really awesome to see this project’s progress. This is a great contribution to the community
I wonder if there’d be any interest in a “peer review” games series, where we take the ten “least fair” matchups you listed in here, and try out playing ten games of just that matchup with a few different players at the helm of each. I agree with @Persephone that some of those matchups don’t jump out at me as intuitively lopsided.
I will say, I think your player skill model is honing in pretty well. Seems like a decent rank order w/ according confidence intervals.
Thanks, @FrozenStorm! Yeah, some of those matchups look far too lopsided, don’t they? I’m a bit happier with their general directions now, though.
I’d be pretty happy if there was some sort of “peer review” series. Sort of a targeted MMM? I’d be up for running / helping out with that.
I had been vaguely thinking of a setup where I take some willing players and give them decks that the model thinks would result in an even matchup, or matchups the model was least certain about. However, I was thinking of doing this over all legal decks, not just monocolour. In the model’s current state, this would probably result in so many lopsided matches that I didn’t think it would be fair to the players.
A quick update on the changes after CAMS19.
bansa jumps up the board after coming second with a Blue starter. Persephone enters on the higher end of the board after a strong performance with MonoGreen (I have no match data yet from before her hiatus).
P1 deck | P2 deck | P1 win probability | matchup | fairness | |
---|---|---|---|---|---|
1 | Green | Purple | 0.493 | 4.9-5.1 | 0.99 |
2 | Black | Black | 0.519 | 5.2-4.8 | 0.96 |
3 | Black | Purple | 0.481 | 4.8-5.2 | 0.96 |
4 | Green | Red | 0.519 | 5.2-4.8 | 0.96 |
5 | Blue | Blue | 0.473 | 4.7-5.3 | 0.95 |
6 | Blue | Purple | 0.473 | 4.7-5.3 | 0.95 |
7 | White | Blue | 0.461 | 4.6-5.4 | 0.92 |
8 | Purple | Purple | 0.548 | 5.5-4.5 | 0.90 |
9 | White | Black | 0.439 | 4.4-5.6 | 0.88 |
10 | Blue | Red | 0.569 | 5.7-4.3 | 0.86 |
11 | Green | Green | 0.431 | 4.3-5.7 | 0.86 |
12 | Red | Black | 0.572 | 5.7-4.3 | 0.86 |
13 | Green | White | 0.581 | 5.8-4.2 | 0.84 |
14 | Red | White | 0.409 | 4.1-5.9 | 0.82 |
15 | White | White | 0.591 | 5.9-4.1 | 0.82 |
16 | Red | Green | 0.610 | 6.1-3.9 | 0.78 |
17 | Red | Blue | 0.615 | 6.1-3.9 | 0.77 |
18 | Red | Red | 0.620 | 6.2-3.8 | 0.76 |
19 | Blue | Green | 0.627 | 6.3-3.7 | 0.75 |
20 | Blue | White | 0.630 | 6.3-3.7 | 0.74 |
21 | Purple | Red | 0.367 | 3.7-6.3 | 0.73 |
22 | White | Red | 0.655 | 6.6-3.4 | 0.69 |
23 | Black | Red | 0.663 | 6.6-3.4 | 0.67 |
24 | Purple | Black | 0.323 | 3.2-6.8 | 0.65 |
25 | Black | Blue | 0.693 | 6.9-3.1 | 0.61 |
26 | Purple | Blue | 0.300 | 3.0-7.0 | 0.60 |
27 | Black | White | 0.709 | 7.1-2.9 | 0.58 |
28 | Blue | Black | 0.289 | 2.9-7.1 | 0.58 |
29 | White | Purple | 0.264 | 2.6-7.4 | 0.53 |
30 | Green | Black | 0.262 | 2.6-7.4 | 0.52 |
31 | Black | Green | 0.745 | 7.5-2.5 | 0.51 |
32 | White | Green | 0.743 | 7.4-2.6 | 0.51 |
33 | Green | Blue | 0.753 | 7.5-2.5 | 0.49 |
34 | Purple | Green | 0.240 | 2.4-7.6 | 0.48 |
35 | Red | Purple | 0.801 | 8.0-2.0 | 0.40 |
36 | Purple | White | 0.824 | 8.2-1.8 | 0.35 |
The main difference here is that P1 MonoPurple is considered to take a beating from P2 MonoGreen (2.4-7.6). This is because the matchup didn’t have much evidence before, and CAMS19 had the following relevant matches:
All P2 wins.
Current monocolour Nash equilibrium:
P1: Pick Black or Red at 16:19.
P2: Pick Black or White at 6:1.
Value: About 5.5-4.5, slightly P1-favoured.
Slight change in overall matchup, both players favour Black much less than before.
If the turn order is unknown, then both players have the same equilibrium strategy: pick Black, all the time, since it outperforms all the other monodecks when averaged over turn order.
Next update should be for Metalize’s data.
I’m confused… these aren’t monocolor matchups, so why are they being considered for those monocolor probabilities? They indicate something about the starter choice, perhaps, but certainly not monocolor matchups…
Because the strength of a deck against another is modelled as the sum of 16 strength components:
For the example first match, only the Present vs. Growth component is relevant to the MonoPurple vs. MonoGreen matchup, but the second match has nine relevant components: any of Purple starter, Future spec, Past spec, vs. any of Green starter, Balance spec, Growth spec.
This allows a starter/spec’s performance in a deck to also inform its likely performance in other decks. I think this is reasonable, and if we don’t do this, the amount of information on most decks, including monodecks, is miniscule.
I originally introduced this approach here, albeit not in much detail:
I think Frozen is reasonably concerned that, without looking at the inner workings of the actual match, its a bit of a stretch to say present and growth actually faced off in any meaningful way. That said, such analysis of the inner workings of a match are well beyond the scope of this data compilation.
In a game where crashbarrows won the day, and the opponent was actively using strength, the fact that growth was an alternate option doesn’t directly mean growth is weak against crashbarrows, but taken together with player skill factors, the more skilled players will choose what they think is the best strategy vs a given deck and it will all even out in the long run.
In any event, I think the statistics are still informative.
Yeah, I can’t really evaluate whether they faced off without doing something much more complicated. The mere presence of a spec could shape the match without it ever being used, though, and the hero can affect things too. The meaning of the Present vs. Growth component is not how well those specs face off, it’s a partial evaluation of how well decks including them face off. You couldn’t just take the neutral components to evaluate the 1-spec starter game format, for example.
This was a ton of work! Thanks for attempting to do this. Also, I would be interested in testing 10-match mono-color intervals (with advice) to further strengthen match-up assertions.
Thanks! I’m going to wait and see how the results for the current tournament affect things, it should reduce the confounding between strong players and strong decks a bit. Then I’ll put up the “lopsided” matches, minus Black vs. Blue, up in a thread, and see who’s interested. If you mean advice from other players, I was thinking of suggesting some warm-up matches, à la MMM1, but not enforcing it, so that could be a good time for that sort of thing.
I’m not sure about the exact format yet. @FrozenStorm was your suggestion above to take one matchup at a time, and have different pairs of players play one game each for a total series of ten? If we do it like that, I could put up the lopsided matches and do a poll on which one to do first/next, and it’s a bit less time-demanding than two players doing ten matches.
@charnel_mouse re-reading my previous post, I think the format I had in mind was something more akin to MMM’s format, where a sign-ups sheet is posted, either to just get a pool of players to be assigned decks, or more exactly like MMM for players to sign up for specific matchups (I think I prefer the latter, as this is meant in my eyes to be an opportunity to “challenge” what’s listed as a “bad matchup” by proving it more even.)
So something like:
A bit sloppy of an idea, I’ll admit, but at least a means of seeing w/ experienced players “is this data reflective of more targeted testing?”
That sounds reasonable. Would it help if I also drew up the model’s matchup predictions when the players are taken into account? Re: Discord calls, if you wanted me on the call too we’d have to work around my being on GMT.
While we’re waiting for XCAFRR19 to finish up, I thought I’d put up what the model retroactively considers to be highlights from CAMS19 – the most recent standard tournament – and MMM1. The model currently considers player skills to be about twice as important as the decks in a matchup, so the MMM1 highlights might be helpful to provide context to some of the recent monocolour deck results.
match name | result probability | fairness |
---|---|---|
CAMS19 R4 FrozenStorm [Future/Past]/Finesse vs. charnel_mouse [Balance]/Blood/Strength, won by charnel_mouse | 0.50 | 0.99 |
CAMS19 R3 bansa [Law/Peace]/Finesse vs. zhavier Miracle Grow, won by bansa | 0.51 | 0.98 |
CAMS19 R10 bansa [Law/Peace]/Finesse vs. zhavier Miracle Grow, won by zhavier | 0.49 | 0.98 |
CAMS19 R3 codexnewb Nightmare vs. Legion Miracle Grow, won by codexnewb | 0.53 | 0.95 |
CAMS19 R6 Persephone MonoGreen vs. bansa [Law/Peace]/Finesse, won by bansa | 0.47 | 0.94 |
match name | result probability | fairness |
---|---|---|
CAMS19 R2 Legion Miracle Grow vs. EricF [Anarchy/Blood]/Demonology, won by Legion | 0.39 | 0.79 |
CAMS19 R9 EricF [Anarchy/Blood]/Demonology vs. zhavier Miracle Grow, won by zhavier | 0.46 | 0.92 |
CAMS19 R6 Persephone MonoGreen vs. bansa [Law/Peace]/Finesse, won by bansa | 0.47 | 0.94 |
CAMS19 R10 bansa [Law/Peace]/Finesse vs. zhavier Miracle Grow, won by zhavier | 0.49 | 0.98 |
CAMS19 R4 FrozenStorm [Future/Past]/Finesse vs. charnel_mouse [Balance]/Blood/Strength, won by charnel_mouse | 0.50 | 0.99 |
match name | P1 win probability | fairness | observed P1 win rate |
---|---|---|---|
zhavier MonoGreen vs. EricF MonoWhite | 0.51 | 0.99 | 0.6 (3/5) |
FrozenStorm MonoBlue vs. Dreamfire MonoRed | 0.49 | 0.99 | 0.6 (3/5) |
Bob199 MonoBlack vs. FrozenStorm MonoWhite | 0.47 | 0.95 | 0.6 (3/5) |
cstick MonoGreen vs. codexnewb MonoBlack | 0.53 | 0.94 | 0.6 (3/5) |
codexnewb MonoBlack vs. cstick MonoGreen | 0.45 | 0.90 | 0.4 (2/5) |
I’d recommend looking at the P1 Black vs. P2 White matchup, and both Black/Green matchups, in the latest monocolour matchup plot, to see how much the players involved can change a matchup.
match name | P1 win probability | fairness | observed P1 win rate |
---|---|---|---|
HolyTispon MonoPurple vs. Dreamfire MonoBlue | 0.08 | 0.15 | 0.0 (0/5) |
Nekoatl MonoBlue vs. FrozenStorm MonoBlack | 0.09 | 0.17 | 0.0 (0/5) |
FrozenStorm MonoBlack vs. Nekoatl MonoBlue | 0.90 | 0.20 | 1.0 (5/5) |
Shadow_Night_Black MonoPurple vs. Bob199 MonoWhite | 0.87 | 0.27 | 1.0 (5/5) |
Bob199 MonoWhite vs. Shadow_Night_Black MonoPurple | 0.16 | 0.32 | 0.0 (0/5) |
Dreamfire MonoBlue vs. HolyTispon MonoPurple | 0.80 | 0.40 | 0.8 (4/5) |
EricF MonoWhite vs. zhavier MonoGreen | 0.78 | 0.43 | 1.0 (5/5) |
A quick update after adding the results from XCAFRR19.
The player chart is getting a little crowded with all the new players recently – hooray! – so I’ve trimmed it to only show players that were active in 2018–2019.
Sorted by matchup fairness.
P1 deck | P2 deck | P1 win probability | matchup | fairness |
---|---|---|---|---|
Green | Purple | 0.501 | 5.0-5.0 | 1.00 |
Red | Black | 0.503 | 5.0-5.0 | 0.99 |
Green | Red | 0.493 | 4.9-5.1 | 0.99 |
White | Blue | 0.484 | 4.8-5.2 | 0.97 |
Purple | Purple | 0.522 | 5.2-4.8 | 0.96 |
Blue | Red | 0.541 | 5.4-4.6 | 0.92 |
Purple | Red | 0.459 | 4.6-5.4 | 0.92 |
Blue | Blue | 0.453 | 4.5-5.5 | 0.91 |
White | Black | 0.443 | 4.4-5.6 | 0.89 |
Black | Black | 0.426 | 4.3-5.7 | 0.85 |
Blue | Purple | 0.427 | 4.3-5.7 | 0.85 |
Red | Green | 0.581 | 5.8-4.2 | 0.84 |
Blue | White | 0.583 | 5.8-4.2 | 0.83 |
White | White | 0.584 | 5.8-4.2 | 0.83 |
Red | Red | 0.595 | 6.0-4.0 | 0.81 |
Green | White | 0.593 | 5.9-4.1 | 0.81 |
Black | Purple | 0.602 | 6.0-4.0 | 0.80 |
Green | Green | 0.378 | 3.8-6.2 | 0.76 |
Red | Blue | 0.631 | 6.3-3.7 | 0.74 |
Red | White | 0.367 | 3.7-6.3 | 0.74 |
Blue | Green | 0.635 | 6.4-3.6 | 0.73 |
Black | Blue | 0.668 | 6.7-3.3 | 0.66 |
Black | Red | 0.670 | 6.7-3.3 | 0.66 |
White | Red | 0.677 | 6.8-3.2 | 0.65 |
Purple | Blue | 0.308 | 3.1-6.9 | 0.62 |
Blue | Black | 0.306 | 3.1-6.9 | 0.61 |
Red | Purple | 0.712 | 7.1-2.9 | 0.58 |
Black | White | 0.709 | 7.1-2.9 | 0.58 |
Green | Blue | 0.721 | 7.2-2.8 | 0.56 |
White | Green | 0.726 | 7.3-2.7 | 0.55 |
Purple | Black | 0.253 | 2.5-7.5 | 0.51 |
Black | Green | 0.745 | 7.5-2.5 | 0.51 |
White | Purple | 0.241 | 2.4-7.6 | 0.48 |
Green | Black | 0.237 | 2.4-7.6 | 0.47 |
Purple | Green | 0.203 | 2.0-8.0 | 0.41 |
Purple | White | 0.798 | 8.0-2.0 | 0.40 |
Some highlights from XCAFFR19. These are retrospective, i.e. after including their results in the (training) data.
match name | result probability | fairness |
---|---|---|
XCAFRR19 R1 FrozenStorm [Discipline]/Past/Peace vs. codexnewb [Future]/Anarchy/Peace, won by codexnewb | 0.50 | 1.00 |
XCAFRR19 R7 Bomber678 [Feral/Growth]/Disease vs. James MonoBlack, won by James | 0.50 | 1.00 |
XCAFRR19 R4 Nekoatl [Balance/Growth]/Disease vs. James [Disease/Necromancy]/Law, won by Nekoatl | 0.49 | 0.99 |
XCAFRR19 R8 Nekoatl [Feral/Growth]/Disease vs. charnel_mouse [Discipline]/Law/Necromancy, won by charnel_mouse | 0.52 | 0.96 |
XCAFRR19 R3 bolyarich [Feral/Growth]/Disease vs. charnel_mouse [Balance/Growth]/Disease, won by charnel_mouse | 0.52 | 0.95 |
XCAFRR19 R4 EricF [Fire]/Disease/Truth vs. bolyarich MonoBlack, won by bolyarich | 0.47 | 0.94 |
XCAFRR19 R5 codexnewb [Future]/Anarchy/Peace vs. OffKilter [Fire]/Growth/Present, won by OffKilter | 0.46 | 0.93 |
XCAFRR19 R1 UrbanVelvet [Anarchy]/Past/Strength vs. CarpeGuitarrem Nightmare, won by UrbanVelvet | 0.54 | 0.93 |
XCAFRR19 R2 codexnewb Nightmare vs. dwarddd MonoPurple, won by codexnewb | 0.54 | 0.92 |
XCAFRR19 R5 Leaky MonoPurple vs. CarpeGuitarrem [Demonology]/Growth/Strength, won by CarpeGuitarrem | 0.55 | 0.91 |
match name | result probability | fairness |
---|---|---|
XCAFRR19 R7 codexnewb MonoPurple vs. zhavier [Anarchy]/Past/Strength, won by codexnewb | 0.36 | 0.73 |
XCAFRR19 R6 zhavier [Anarchy]/Past/Strength vs. FrozenStorm MonoPurple, won by FrozenStorm | 0.40 | 0.80 |
XCAFRR19 R3 codexnewb [Future]/Anarchy/Peace vs. Leaky [Discipline]/Disease/Law, won by Leaky | 0.41 | 0.81 |
XCAFRR19 R5 codexnewb [Future]/Anarchy/Peace vs. OffKilter [Fire]/Growth/Present, won by OffKilter | 0.46 | 0.93 |
XCAFRR19 R4 EricF [Fire]/Disease/Truth vs. bolyarich MonoBlack, won by bolyarich | 0.47 | 0.94 |
XCAFRR19 R4 Nekoatl [Balance/Growth]/Disease vs. James [Disease/Necromancy]/Law, won by Nekoatl | 0.49 | 0.99 |
XCAFRR19 R1 FrozenStorm [Discipline]/Past/Peace vs. codexnewb [Future]/Anarchy/Peace, won by codexnewb | 0.50 | 1.00 |
Here’s something new: the model estimates the variance in player skills and opposed deck components, in terms of their effect on the matchup. This means that I can directly compare the effect different things have on a matchup:
The boxes are scaled according to how many elements of that type go into each matchup: 2 players, 1 starter vs. starter, 6 starter vs. spec, 9 spec vs. spec. On average, the players’ skill levels have twice the effect on the matchup that the decks do.
Roughly speaking, that means that, if we took the multi-colour decks with the most lopsided matchup in the model, and gave the weak deck to an extremely strong player, e.g. EricF, and gave the strong deck to an average-skill player, e.g. codexnewb (average by forum-player standards, remember), we’d expect the matchup to be roughly even.
As always, comments or criticism about the results above are welcome. Let me know if there are particular games you’d like the results for too, although those should be available to view on the new site soon.
I’m in the process of making a small personal website, so I can stick the model results somewhere where I have more presentation format options. In particular, I look at the model’s match predictions using JavaScript-style DataTables, where you can sort and search on different fields (match name, model used, match fairness, etc.), so it would be nice if other people could use them too.
It will also let me more easily make the inner model workings more transparent. When I get time, I’m planning to add ways to examine cases of interest, like the Nash equilibrium among a given set of decks, or a way to view the chances of different decks against a given opposing deck and with given players. The latter, in particular, would be useful to let other people evaluate the model for matches that they’re playing.
I have other versions of the model that use Metalize’s data, but I’m going to delay showing those until I’ve finished putting up the site, and have tidied up the data a bit.
OK, it’s pretty rough, but I now have a first site version up here. I had some grappling with DataTables and Hugo to do, so I should be able to add things more easily after this first version.
Things I’m planning to add first:
I’ll still do update posts here about model improvements, interesting results etc.; the site’s there for people to play around with the model themselves.