Revised Suggested Grading Benchmarks (PR and Elo)

Posted By: Rick Janowski
Date: Sunday, 5 October 2014, at 12:39 p.m.

In Response To: Suggested Idea for Performance Grading (Rick Janowski)

I summarise below the revised suggested PR benchmarks for the four general levels of Grandmaster (G1 to G3), Master (M1 to M3), Advanced (A1 to A3) and Intermediate (I1 to I3). I also provide a viable set of commensurate Elo ratings, where the rating for G3 is arbitrarily set at 2000. This could vary from system to system but the rating differences between grades should generally not vary. I think these are close to long-term average rating values on GridGammon.

G-3 PR =< 3.00 (Elo >= 2000)

G-2 PR =< 3.50 (Elo >= 1983)

G-1 PR =< 4.00 (Elo >= 1967)

M-3 PR =< 4.60 (Elo >= 1947)

M-2 PR =< 5.30 (Elo >= 1923)

M-1 PR =< 6.00 (Elo >= 1900)

A-3 PR =< 7.00 (Elo >= 1867)

A-2 PR =< 8.50 (Elo >= 1817)

A-1 PR =< 10.00 (Elo >= 1767)

I-1 PR =< 12.00 (Elo >= 1700)

I-2 PR =< 14.50 (Elo >= 1617)

I-3 PR =< 17.50 (Elo >= 1517)

Note the very small difference in Elo ratings between all the grandmaster and master grades (all within a 100 Elo band). I believe that this demonstrates why normal Elo rating systems as originally applied in Kent Goulding’s Database and on Backgammon Servers is unreliable in distinguishing between the top grades. Typically, on the Backgammon Servers, any instantaneous rating will vary by plus or minus 100 rating points over say a 2-year period. The VRR Elo system proposed by Bob Koca might well reduce variance to acceptable levels but it may need a very high number of matches (all recorded, transcribed and analysed) before reliable ratings are obtained). This might be just too impractical. I hope Bob can clarify in this area, as I am guessing to some extent. Perhaps both the PR approach and the VRR Elo approach (once reliable for any particular player) could be run in parallel, each individually giving an acceptable basis for the award of the various grades.

Other bots which measure error rate (or PR) could be used instead of XG, by recalibrating the benchmark PR-values. This should not be too difficult. Moreover any future designed bot or improved version of existing bot can be readily calibrated. Essentially, XG2 values are listed here because it is generally acknowledged to be the most reliable measuring tool currently available for measuring error rate. It would be sensible to use new tools in the future provided there is significant improvement and re-calibration.

The actual PR levels measured for an attempt at one of the grades should be a set of pre-designated matches satisfying grade specific requirements for minimum match length and sample size. Once used in any attempt they cannot be reused in any further attempt, unless this attempt was successful and the player wishes to use all of the matches in a new attempt for a higher grade requiring a larger sample size. There should be some waiting period before a failed attempt at a grade is re-attempted, perhaps 3 months.

