While reproducing the Erdős min-overlap result using Tinker, I noticed something odd: at step 34 my run logged a score of 0.38092, which appeared better than the paper's claimed 0.380932 — that felt suspicious, so I dug deeper.
It turns out the logged score on W&B is not the verified C₅, but rather the model's self-claimed c5_bound. In evaluate_erdos_solution, verify_c5_solution is called and does compute the true value, but its return value is discarded — the function returns the claimed c5_bound instead:
def evaluate_erdos_solution(h_values, c5_bound, n_points) -> float:
verify_c5_solution(h_values, c5_bound, n_points) # return value dropped
return float(c5_bound) # model's self-reported value
The validation only checks np.isclose(..., atol=1e-4), so a model can legally under-report by up to ~9e-5 and still pass. The actual verified C₅ from my run is 0.380972, while the claimed score is 0.380932 — a gap of 4.07e-5, which is within tolerance and would silently pass.
This means the paper's claimed score of 0.380932 is unverifiable as-is: if the true C₅ were 0.380972 (as in my reproduction), the system would accept 0.380932 as a valid claim without raising any error.
I suspect this is a bug — evaluate_erdos_solution should return the verified computed value rather than the claimed one:
def evaluate_erdos_solution(h_values, c5_bound, n_points) -> float:
computed_c5 = verify_c5_solution(h_values, c5_bound, n_points)
return float(computed_c5)
Happy to submit a PR. Would also be helpful to know whether the published scores were logged from c5_bound or from the verified value.
While reproducing the Erdős min-overlap result using Tinker, I noticed something odd: at step 34 my run logged a score of 0.38092, which appeared better than the paper's claimed 0.380932 — that felt suspicious, so I dug deeper.
It turns out the logged score on W&B is not the verified C₅, but rather the model's self-claimed
c5_bound. Inevaluate_erdos_solution,verify_c5_solutionis called and does compute the true value, but its return value is discarded — the function returns the claimedc5_boundinstead:The validation only checks
np.isclose(..., atol=1e-4), so a model can legally under-report by up to ~9e-5 and still pass. The actual verified C₅ from my run is 0.380972, while the claimed score is 0.380932 — a gap of 4.07e-5, which is within tolerance and would silently pass.This means the paper's claimed score of 0.380932 is unverifiable as-is: if the true C₅ were 0.380972 (as in my reproduction), the system would accept 0.380932 as a valid claim without raising any error.
I suspect this is a bug —
evaluate_erdos_solutionshould return the verified computed value rather than the claimed one:Happy to submit a PR. Would also be helpful to know whether the published scores were logged from
c5_boundor from the verified value.