The Anti-Overfitting LLM Logical Reasoning Test Series

A series of simple questions that nonetheless pose serious challenges to LLMs, unlike common mathematical benchmarks (though those are dozens of times more difficult). These questions focus on testing models' genuine generalization and reasoning abilities, so we essentially want problems that challenge LLMs while remaining as easy as possible for humans.

The difficulty level roughly matches the APOS scale, where level 1 corresponds to "difficulty 1" as noted in the linked resource. Note that levels 3-6 fall within the AIME (American Invitational Mathematics Examination) range, while levels 7-10 reach IMO (International Mathematical Olympiad) or Putnam competition difficulty.

https://artofproblemsolving.com/wiki/index.php/AoPS_Wiki:Competition_ratings

All model statistics shown have been tested (complete responses to be added later). Typically, accuracy rates below 20% are considered as scoring 0 points. If observable accuracy exists, scores are assigned based on the probability of correct answers and the quality of responses.

Difficulty Scale

Loading difficulty scale...

Answer Accuracy Scale

Loading accuracy scale...

Download CSV Download JSON

Loading model statistics...

Loading puzzles...

The Anti-Overfitting LLM Logical Reasoning Test Series

Difficulty Scale

Answer Accuracy Scale

Question

Model Answer