
BGonline.org Forums
Using Noisy Evaluations to Measure Performance
Posted By: Chris Yep
Date: Thursday, 31 March 2016, at 5:36 a.m.
In Response To: Winning PR vs Losing PRone thing impossible to measure (Phil Simborg)
In my opinion a flaw in typical error rate / performance rate measurements (Snowie, GNU, XG) is that every move counts as zero "decisions" or one "decision" with no middle ground. The ER/PR measurement doesn't account for the fact that some positions are easy (e.g. most bearoff checker plays), while other positions are difficult. It also doesn't account for the fact that some positions are unimportant; e.g. it may be difficult to determine the best checker play, but every legal move may be within .005 of the best move.
XG tries to account for unimportant moves by excluding forced moves and "irrelevant" moves (defined as moves where all legal plays are within .001 of the best play). Similarly, GNU excludes forced moves. However, I think more can be done in this area.
A New Performance Rating Measurement:
What I'd like to see is a comparison of a player's errors to the average errors of 100 intermediate or advanced players. Ideally these 100 players would be humans, but since that's not realistically possible a bot could simulate humans by using evaluations with added noise. In fact GNU and XG are already setup to do this. For example, the GNU player "Advanced" gets the evaluation of any play by starting with its standard 0ply evaluation and then adding noise (a normal distribution with mean 0 and standard deviation 0.015). The GNU player "Intermediate" uses a similar algorithm (with standard deviation 0.040). XG also has a similar algorithm for its lowerlevel players (Advanced, Intermediate, Beginner, and Distracted).
A possible algorithm for a new PR:
1. Choose two constants:
N = standard deviation of noise; it shouldn't be too small or too large; my guess is that it's better to err on the high side; maybe use something in the range of 0.04 to 0.08.
Note that to better mimic humans, N should be smaller in certain types of positions (e.g. bearoff checker plays) and larger in other types of positions (e.g. complex prime vs. prime checker plays). Also, N should probably be larger for cube decisions. But for simplicity, just assume a constant N for all checker plays and all cube decisions.
C = a multiplicative constant calibrated so that the average number of "decisions" (see below) is about the same as under XG's current PR calculation; this ensures that average PRs are about the same as under XG's old PR method.
2. Define 100 "noisy" players using the above method (e.g. XG 1ply with noise or GNU 0ply with noise). E.g. if N = 0.04 use 100 GNU "Intermediate" players. Of course each noisy player should use a different random seed/counter. The choice of random seeds/counters can be done in a systematic ("deterministic") way so that the overall result can be duplicated on other computers.
For every position analyzed, feed the position to the 100 noisy players. Keep track of each of their (top) move choices. Let E = the average error of these 100 (top) move choices. Then for the purpose of the new PR, this position counts as E*C "decisions." Note that C is chosen (calibrated) by studying a large sample of human matches (live or online matches). C should be chosen so that the number of decisions under the old PR method is approximately equal to the number of decisions under the new PR method.
Some examples:
(a) Forced moves: Since the average error of these 100 noisy players is zero on forced moves, forced moves don't count as decisions.
(b) "Irrelevant moves" (moves where all legal plays have approximately the same equity): Since the average error of these 100 noisy players is very small, these moves count as only a small fraction of a decision.
(c) Moves with a clear choice: If the best play is much better than the secondbest play, then most of the time all 100 noisy players will choose the best play, so most of the time this won't count as a decision.
(d) Moves where the average error (of the 100 noisy players) is high: These count as more than a full decision. I think this is reasonable as these positions are "difficult" according to the 100 noisy players.
Overall thoughts on this algorithm?
Some possible criticisms:
1. A noisy player with added noise doesn't perfectly model a human. (I agree, but I think it's a step in the right direction.)
2. The algorithm may be computationally expensive. (I doubt it, as it's restricted to 0ply [XG 1ply] evaluations plus noise. In any case, the algorithm could define 30 noisy players for lowlevel analysis (with lowlevel settings), 100 noisy players for mediumlevel analysis, and 1000 noisy players for highlevel analysis.)
I would love to see this algorithm implemented by a major backgammon program. My guess is that it's a more "accurate" (reliable) measure of performance than GNU ER and XG PR.

BGonline.org Forums is maintained by Stick with WebBBS 5.12.