Codex data thread

Yomi’s had historical matchup data for a while, so I thought I’d start putting some together for Codex.

I put together information for the CAMS 2018 matches at the weekend and started doing some statistical analysis on them. Spreadsheet is here: Codex matches - Google Tabellen. Let me know if that link doesn’t work.

I’ll add other tournaments to this when I have the time, but if anyone else has been collecting this stuff, feel free to throw it up here. Feel free to put up any analysis you’ve been doing, too.

4 Likes

EDIT: given there has been some discussion about rankings on this forum before, e.g. discussion here on ranking individual specs, I should mention two things for what I’m planning for later analyses.

  • I’m not trying to make objective rankings here. What I would like to do is to get a rough set of rankings, good spec pairings etc., and compare them against general opinions to see where they agree and where they don’t.
  • I’ll be trying to accont for interactions, e.g. players being better with certain setups, or certain specs pairing together well. Also, some idea of individual matchups, where effectiveness depends on the other player’s specs.

I also threw together a quick analysis, some initial results are below. These should be taken with a large piece of salt:

  1. Currently I only accountfor general first-player advantage, player skill, and player order effect on a player’s skill (i.e. whether they’re better or worse when going first). There’s currently no accounting for decks, so if a player is shown below as being better/worse, that’s really referring to the performance of that player when using their chosen deck, rather than that player in general.
  2. I’ve only compiled match data from CAMS 2018 so far.

I ignore matches that ended in a timeout.

Once I’ve got data from more tournaments, performance of decks and players can be teased out, and results should get more interesting. Still, hopefully the results below give an idea of what I’ll be doing.

Model details for stats nerds

This is effectively a logistic regression model: player skill for each player, average effect of player order, and their interaction for each player have a linear effect on the log odds of winning a match. For each player, the skill effect of going second is the opposite of going first, so their generall skill coefficient is in the middle, keeping the main effects easily interpretable.

Priors: all coefficient have an independent simple normal prior.

Even without accounting for chosen specs / starter decks, this model is a pain to do exact analysis for, so I ran an MCMC chain for 10000 interations. Each proposal added a N(0,1) value to each model coefficient. Convergence was immediate, so no mucking around with burn-in times here. That’s a pretty short chain, but all the marginal coefficient densities look roughly normally distributed, so I reckon it’s good enough. Analysis was done in R, MCMC code was from scratch. I switch over to Stan once I’ve added deck effects.

General first-player advantage

Players going first tended to do better, but not by much. First-player advantage is dwarfed by player effects.

Overall player skill

Player skill matches up pretty closely with final tournament results. Main exception I can see is Bomber. Lower-ranked players have a wider skill uncertainty interval, because they had less games.

By-player effect of player order

Ranking high on this one means that you had much better performance as player 1 than as player 2, or vice versa. Probably better to be lower-ranked here. I remember losing every time I was player 1 and winning every time I was player 2, since my codex couldn’t do early aggression, so it seems reasonable I’m ranked highest here.

Player skill levels when going first

Player skill levels when going second

I swear I didn’t set the latter one up on purpose.

5 Likes

Oh man does that really show me as being third? How did that happen?

3 Likes

Very cool!

1 Like

Maybe the model already hates Vandy, too.

2 Likes

A few more graphs before I take the time to account for spec choices. Again, just for CAMS 2018.
I added in effects for starter decks, given turn order.
I haven’t added effects for players’ proficiencies with their chosen decks, that’ll need data from multiple tournaments. Effectively, that means all players are assumed to be equally proficient with all starter decks.

Summary for player skill levels from last post

Player skill levels, accounting for starter decks

Starter deck power levels, accounting for players

Lettucemode and Carpe come out with a higher skill due to playing MonoGreen and MonoRed, top player is closer between zhavier and Dreamfire. Purple does better for first player, which probably says more about the specs it was paired with rather than MonoPurple being better for first player in general.

Green starter sucks, Black is really good, especially for second player. Nothing too surprising so far, I think.

1 Like

If green sucks it’s worth noting no one has used blue starter in a tournament that I can remember. I bet someone has but I just don’t remember.

1 Like

Well, we’ll see how Green does once I’ve added more tournament data. When I start looking at interaction with spec choices, it might do well if paired with a more aggressive hero/spec.

You could always use the deck you’ve been playing me with.

I would argue my recent random deck has ALMOST got synergy… but its still probably not even B list. Or I am just bad at it :stuck_out_tongue:

That said, I am toying with the idea of how to get people to play more with the blue starter. There has to be some kind of deck that makes the blue starter at least a little competitive.

I’ve just added match data for CAMS 2017 to the spreadsheet, and will add some more over the next few days. I’ve also added stuff to the model, so I’ll put up a results update soon. In the meantime, go and look at @vengefulpickle’s work here on applying a similar model to Yomi match data, if you haven’t already.

3 Likes

Long post incoming

I’ve now added CAPS 2017. I’m not going to announce every addition, but I thought this one is worth mentioning, because with EricF and Shax both using the Blue starter in that tournament, we now have data, however scant, for all of the starter decks! Nothing for Bashing, though.

Let’s do a proper update.

Model structure

What’s the model currently accounting for?

  • First-player (dis)advantage
  • Player skill, which can be affected by whether they’re player 1 or 2
  • Starter deck strength, affected by player position
  • Spec strength, affected by player position

What’s it not accounting for?

  • Player skill changing over time
  • Synergies, or lack of, between specs and starters - all four parts of a deck are considered in isolation
  • Effect on player skill of opponent, deck used (e.g. familiarity with their deck
  • Effect on deck strength of opposing deck

Of these, the starter/spec interactions are probably going to have the biggest effect, so I’ll be adding that next time.

Prior choice: what do model predictions look like before adding data?

The model should make reasonably sane predictions before the data added, so we’re not waiting for 20 years for enough data for the model to make useful conclusions. Here’s what the distributions for first-turn advantage looks like:

Prior first-player advantage

And here’s what the player skill distributions look like (players are interchangeable here):

Main player effect distributions

Prior distributions for starter and deck strengths are the same as for player skills.
For reference, first-player advantage, player skill, and deck strength are adjustments to the log-odds of winning. Log-odds look roughly like this:

Victory log-odds Victory probability
-2.20 0.1
-1.39 0.2
-0.85 0.3
-0.41 0.4
0.00 0.5
0.41 0.6
0.85 0.7
1.39 0.8
2.20 0.9

Player skills, and the rest, are mostly in the log-odds range from -2 to 2. That’s pretty swingy, but not completely implausible, so these priors aren’t too bad. I’d rather be under- than over-certain.

I don’t care. Show me the results.

OK, OK.

Turn order

First-turn advantage

Player order makes no discernible difference. Maybe a slight advantage to player 1.

Player skill

Average player skill levels

I have no idea what to make of the player skill predictions, given how many of these players vanished before I got here :frowning: The top few players look about right, I’m not so sure about the rest. There are a few players who are automatically in the middle because they haven’t played enough in these tournaments for the model to get much information on their skill level (e.g. Kaeli, with a single non-timeout match), so that will throw the ordering off what might be expected.

I also have plots giving the probability each player is the best. This is the one for overall skill, I have plots by turn order too if people want them. These plots take account of uncertainty where the average skill plots don’t, so these are probably more helpful.

Probability of each player being the best

The two highest-ranking players are only at about 16% each of being the best, so the rankings could still change dramatically with more data.

Deck strength

Here are the starter decks:

Overall starter strength


Neutral starter does really badly here.

Starter strength by turn order


Yeah, I’m not too convinced by these at the moment. Bear in mind these are evaluations of the starters independent of the specs they’re paired with, but they still look pretty odd.

Summary plot for starter deck strengths

Now for the specs.

Summary plot for spec strengths

These look like a mess at the moment, too, so I’m going to skip over the other spec plots and move on to matchup predictions.

How much should we trust these results?

Well, you’ve probably looked at the starter and spec results and decided not very much. I’d like something to evaluate the model with in addition to the opinion of the players, even if that takes priority. Therefore, I’ve asked the model for post-hoc predictions for the outcome of all the matches. This is cheating a bit, because I’m using the data used to fit the model to evaluate, but it should give a rough idea of how it’s doing.

First of all, we can look at the matches where the model didn’t think a match would go the way it did. Here’s the top 10 “upsets” in the model’s opinion.

Match Modelled probability of outcome
CAMS 2018 FrozenStorm Nightmare vs. zhavier Miracle Grow , won by zhavier 0.29
CAPS 2017 zhavier Miracle Grow vs. EricF Peace/Balance/Anarchy , won by EricF 0.32
CAPS 2017 FrozenStorm Nightmare vs. petE Miracle Grow , won by petE 0.36
CAMS 2018 cstick Finesse/Present/Discipline vs. RathyAro Nightmare , won by cstick 0.39
CAPS 2017 Penatronic Present/Peace/Blood vs. robinz Discipline/Fire/Truth , won by robinz 0.40
CAMS 2018 RathyAro Nightmare vs. Nekoatl Demonology/Strength/Growth , won by RathyAro 0.44
CAPS 2017 Jadiel Feral/Future/Truth vs. EricF Peace/Balance/Anarchy , won by Jadiel 0.47
CAMS 2018 zhavier Miracle Grow vs. Dreamfire Demonology/Strength/Growth , won by Dreamfire 0.47
CAMS 2018 zhavier Miracle Grow vs. FrozenStorm Nightmare , won by zhavier 0.48
CAMS 2017 Shadow_Night_Black Feral/Present/Truth vs. zhavier Discipline/Present/Anarchy , won by Shadow_Night_Black 0.49

I haven’t had time to actually read through the matches, so if you have opinions about how unexpected these results are, let me know! What’s worth mentioning is that, out of the 108 matches I’ve current recorded, these are the only 10 matches where the model put the probability of the given outcome at less than 50%. A lot of the rest were thought to be pretty lopsided:

Distribution of post-hoc match predictions

Now, we expect these to be thought of as lopsided to some extent: these are the matches used to fit the model, so it should be fairly confident about predicting them. So here’s a finer breakdown, where we roughly compare the model’s predicted outcome to how often it actually happened:

Predicted vs. actual player 1 win rate

It looks like the model’s predicted matchups aren’t lopsided enough! The matchups are even more extreme than it thinks they are.

What’s next?

This is where the model is at right now. It’s got some promise, I think, but it desperately needs, as a bare minimum, more data, and to take account of starter/spec synergies before its skill/strength results are really reliable.

Current plans

Next I’ll be allowing for starter/spec pairings to have an effect on deck strength, in addition to their individual effects. Hopefully we’ll see decks like Nightmare and Miracle Grow rapidly climb to the top of the charts.

I also need to add more match data. I don’t want to go too far back in time yet, because then the change in player skills over time will become something I need to worry about, so I’ll be adding results from the LDT series first.

Anything I can help with that doesn’t involve understanding the statistics spiel?

The most helpful thing I could get right now is anything odd on the current results set that I haven’t picked up on. I know the deck evaluations are a bit weird. Feedback on the current player ranking would be nice, if it won’t start fights. Most immediately helpful would be thoughts on the top 10 upsets list, as currently ranked by the model. Do these match results seem particularly surprising, in hindsight? Would you expect them to go the same way if they were played again?

Stop filling the forums up with so many images

If that’s a problem, I can just put the model up on GitHub, and occasionally bump the thread when I do a big update. Unless I get told otherwise, I’ll just post images for now.

Show me the stats!

Stan model code
data {
  int<lower=0> M;                          // number of matches
  int<lower=0> P;                          // number of players
  int<lower=0> St;                         // number of starter decks
  int<lower=0> Sp;                         // number of specs
  int<lower=1> first_player[M];            // ID number of first player
  int<lower=1> second_player[M];           // ID number of second player
  int<lower=1> first_starter[M];           // ID number of first starter deck
  int<lower=1> second_starter[M];          // ID number of second starter deck
  int<lower=1> first_specs1[M];            // ID number for first player's first spec
  int<lower=1> first_specs2[M];            // ID number for first player's second spec
  int<lower=1> first_specs3[M];            // ID number for first player's third spec
  int<lower=1> second_specs1[M];           // ID number for second player's first spec
  int<lower=1> second_specs2[M];           // ID number for second player's second spec
  int<lower=1> second_specs3[M];           // ID number for second player's third spec
  int<lower=0, upper=1> w[M];              // 1 = first player wins, 0 = second player wins
}
parameters {
  real turn;                               // first-player advantage in log odds
  vector[P] player_std;                    // player skill levels in log odds effect
  vector[P] player_turn_std;               // player skill level adjustment for going first (penalty if second)
  vector[St] starter_std;                  // starter deck strengths
  vector[St] starter_turn_std;             // starter deck strength adjustment for going first (penalty if second)
  vector[Sp] spec_std;                     // spec strength
  vector[Sp] spec_turn_std;                // spec strength adjustment for going first
  real lsd_player;                         // player skill log spread
  real lsd_player_turn;                    // player skill turn adjustment log spread
  real lsd_starter;                        // starter deck strength log spread
  real lsd_starter_turn;                  // starter deck strength turn adjustment log spread
  real lsd_spec;                          // spec strength log spread
  real lsd_spec_turn;                    // spec strength log turn adjustment spread
}
transformed parameters{
  vector[M] matchup;                                      // log-odds of a first-player win for each match
  real<lower=0> sd_player = exp(lsd_player);
  real<lower=0> sd_player_turn = exp(lsd_player_turn);
  real<lower=0> sd_starter = exp(lsd_starter);
  real<lower=0> sd_starter_turn = exp(lsd_starter_turn);
  real<lower=0> sd_spec = exp(lsd_spec);
  real<lower=0> sd_spec_turn = exp(lsd_spec_turn);
  vector[P] player = sd_player * player_std;
  vector[P] player_turn = sd_player_turn * player_turn_std;
  vector[St] starter = sd_starter * starter_std;
  vector[St] starter_turn = sd_starter_turn * starter_turn_std;
  vector[Sp] spec = sd_spec * spec_std;
  vector[Sp] spec_turn = sd_spec_turn * spec_turn_std;
  matchup = turn +
    player[first_player] + player_turn[first_player] - player[second_player] + player_turn[second_player] +
    starter[first_starter] - starter[second_starter] + starter_turn[first_starter] + starter_turn[second_starter] +
    spec[first_specs1] - spec[second_specs1] + spec_turn[first_specs1] + spec_turn[second_specs1] +
    spec[first_specs2] - spec[second_specs2] + spec_turn[first_specs2] + spec_turn[second_specs2] +
    spec[first_specs3] - spec[second_specs3] + spec_turn[first_specs3] + spec_turn[second_specs3];
}
model {
  lsd_player ~ normal(0, 0.1);
  lsd_player_turn ~ normal(0, 0.1);
  lsd_starter ~ normal(0, 0.1);
  lsd_starter_turn ~ normal(0, 0.1);
  lsd_spec ~ normal(0, 0.1);
  lsd_spec_turn ~ normal(0, 0.1);
  turn ~ std_normal();
  player_std ~ std_normal();
  player_turn_std ~ std_normal();
  starter_std ~ std_normal();
  starter_turn_std ~ std_normal();
  spec_std ~ std_normal();
  spec_turn_std ~ std_normal();
  w ~ bernoulli_logit(matchup);
}
What's this "partial pooling" you're referring to in the plot subtitles?

Partial pooling is a statistical technique often used by Stan users, and people who read the work of Andrew Gelman, and is a type of hierarchical model.

As an example, I previously had all the player skill levels modelled as independent. That gives reasonable results, but partial pooling lets me control how large the levels can get, and also introduces some dependency.

Specifically, whereas before each skill level had an independent Normal(0, 1) distribution, they now all have an independent Normal(0, sd) distribution, where sd is an additional unknown parameter for the model to do inference on (these are the lsd_... and sd_... variables in the Stan model code above). So, if a player’s skill is considered to be large, then sd will be pushed to be larger, and the other player skill will tend to spread out a bit more too. Statistically, this has a “shrinkage” effect, that stops any of the estimated skill/strength levels from becoming infeasible large at the expense of everything else.

This is useful for two other reasons. Firstly, the use of an sd variable means that first-player advantage, player skill, starter strength etc. can now have a different average size of effect. If the sd for player skill tends to be larger, that means player skill is considered to tend to have a larger effect on matchup compared to first-turn advantage. Secondly, sd for e.g. player skill determines the general population that the modelled players’ skill levels are drawn from. This means that inference on sd translates to inference on what we expect the skill level of non-modelled players to look like. Has a new player just appeared? sd will help give an a priori idea of what their skill level might look like.

It’s called partial pooling in comparison to, e.g., modelling all player skill levels as exactly the same (complete pooling), or treating them all independently (no pooling).

5 Likes

I don’t think I’m in the running for fourth best player, so something feels strange there.

Well, your only tournament data I’ve got at the moment is CAMS this year, and you did well there. That’s the only data for me at the moment, once I’ve added XCAFS I expect my rating to tank.

1 Like

This is super interesting stuff @charnel_mouse, I appreciate the obvious abundance of effort you’re putting in here!

I think labeling the player graph based on “skill” feels a bit weird though; I feel like perhaps it’s weighting results too high without accounting for deck strength (or perhaps as you pointed out, there’s just too little data). Hobusu, Nekoatl and codexnewb in particular feel very out of place on that chart near the bottom (they are plenty skilled), and I am definitely not a better player than Zhav or EricF (and probably Dreamfire).

Nevertheless, this is super cool stuff, thanks for sharing!

2 Likes

Thanks @FrozenStorm! Good to hear people are finding this interesting, and not being put off by the post length.

I agree those players are ranked pretty low at the moment. That’s partly the number of matches, but it’s also because I’ve only got three tournaments’ worth. That means the players, at most, are being observed with three different decks, usually less, so the model is struggling with sufficiently disentangling player skill from deck strength. I think “skill” is the correct term, because the model is at least trying to account for deck strength and turn advantage, just not well enough yet.

3 Likes

The work you are doing here is amazing.

How big of a data sample your model would need to be meaningful?

I could try to scrounge for local history of games from reports and provide maybe 30-50 games between a few regulars.

There are exactly zero players from local scene who also play on the forum so far, so no possibility for cross-reference, sadly.

But could still be used as a context for comparing decks.

Blue has much more spotlight here, i.e.

4 Likes

Hey, thanks. I honestly have no idea how much more data it needs. It definitely needs more deck variety per player, but I can’t put any numbers to that.

If you want to add games to the sheet, please do! I’m going to change the format at some point, but if you put “Offline play” or something in the tournament column, it should be good!

2 Likes

Can you change the format ASAP? I’d suggest something like
start_year tournament round player1 player2 victor victory starter spec1 spec2 spec3

Disjointing deck from event seems necessary. Would allow to record casual matches.

Edit: I think I’ll go ahead and start filling out the file with this format, and you can adapt to it when convenient, unless you’d have it differently.

Edit2: Wait no, I can’t record both player’s deck in the same row that way. Damn.

Edit3: I guess we have to implement Match ID keyfield of sorts? Or stick to one match = one row and make extra 4 fields for second deck?

Yeah, go for it. I’m working off a local version of the sheet rather than the online version, so it doesn’t screw up anything I’m doing.

If it’s easier, put in the decks in the forum’s format, e.g. [Demonology/Necromancy]/Finesse. I think I’m going to have a separate sheet that R can use as a key to derive the specs and starter from that. Then just have a deck1 and a deck2 column for each match, so it’s all on a single row.

I hadn’t gotten as far as thinking about keyfields, to be honest. I’ve only been looking at tournament matches, so it hasn’t come up. If you think it helps, go for it.

2 Likes

@charnel_mouse I could help out with entering the MMM data for mono-color results, if that’s at all worth adding.

1 Like