A series of simple questions that nonetheless pose serious challenges to LLMs, unlike common mathematical benchmarks (though those are dozens of times more difficult). These questions focus on testing models' genuine generalization and reasoning abilities, so we essentially want problems that challenge LLMs while remaining as easy as possible for humans.
The difficulty level roughly matches the APOS scale, where level 1 corresponds to "difficulty 1" as noted in the linked resource. Note that levels 3-6 fall within the AIME (American Invitational Mathematics Examination) range, while levels 7-10 reach IMO (International Mathematical Olympiad) or Putnam competition difficulty.
https://artofproblemsolving.com/wiki/index.php/AoPS_Wiki:Competition_ratings
All model statistics shown have been tested (complete responses to be added later). Typically, accuracy rates below 20% are considered as scoring 0 points. If observable accuracy exists, scores are assigned based on the probability of correct answers and the quality of responses.