|
BGonline.org Forums
How to design a statisctally significant bot head-to-head?
Posted By: Timothy Chow In Response To: How to design a statisctally significant bot head-to-head? (Ian Shaw)
Date: Friday, 8 January 2010, at 10:18 p.m.
Ian Shaw wrote:
If I want to play off two bots in a head-to-head contest, how do I choose the right number of games to play
If you don't have any idea what the strength difference is between the bots, then there isn't any way to pick the "right" number of games to play in advance. The simplest option is just to pick some large number of games, play them out, and see what you get.
you are not supposed to keep extending the test until one bot is far enough in front.
Actually, this is not out of the question, but the probabilities involved with this approach are trickier to compute and easy to get wrong, so I wouldn't recommend it.
Is the design of the test affected if I have a benchmark result which implies that Bot A is x ppg better than Bot B?
Yes. For example, if you stick to your original plan of testing the null hypothesis that "the bots are equal," then this information (if correct) will give you some sense of how many games you want to play. Suppose, to simplify the math, you play DMP matches rather than cubeless money games, and you expect that the probability that Bot A will beat Bot B in a single game is p. Then if you play N games, the expected number of Bot-A wins is pN, and the standard deviation is sqrt(Np(1-p)), so if you pick N so that, say, (p - 0.5)N = 4sqrt(Np(1-p)), then with high probability you will get a statistically significant result. Solving, we get N = 16p(1-p)/(p - 0.5)^2 or about 4/(p - 0.5)^2. For cubeless money games you may need somewhat more trials because the gammons will raise the variance, but this is just a rough-and-ready calculation anyway so I wouldn't worry too much about that.
The other thing you can do is that instead of just picking "the bots are equal" as your null hypothesis, you can set up two hypotheses and try to decide between them. The first hypothesis is that the bots are equal, and the second hypothesis is that Bot A is better than Bot B by x. Then each trial you run will give you evidence for one hypothesis or the other and you can keep accumulating evidence until, say, the likelihood of one hypothesis is 100 times that of the other hypothesis. Typically, you'll need fewer trials to reach this conclusion than I calculated above, because you're not trying to reject the null hypothesis "absolutely" but are just comparing its likelihood with the alternative hypothesis and trying to decide which one fits the evidence better.
In the end, of course, I want to know which bot is better. Will the test to see whether they are equal answer this question.
Yes. Common sense prevails here; if you've played enough games to be statistically significant, then the bot that comes out on top is the better one.
By the way, there's another approach, which is to regard what you're doing not as rejecting the hypothesis that the bots are equal, but as estimating the difference in strength between the bots. The limitation of the method you suggested is that the end, strictly speaking all you've established is the bare statement that "Bot A is better than Bot B." But typically, we're also interested in how much better Bot A is. If you view your experiment not as a problem in decision theory but as a problem in estimation theory, then you don't need to worry so much about how many trials to run. Just run as many trials as you have the resources for, and combine them to get the best estimate you can from the data you have. This is actually the approach I would recommend.
For speed, can I run a shorter test simultaneously on several processors, and combine the results by summing them, weighted by the number of games completed, of course?
Yes.
|
BGonline.org Forums is maintained by Stick with WebBBS 5.12.