The Isabelle Community Benchmark

09/28/2022
by   Fabian Huch, et al.
0

Choosing hardware for theorem proving is no simple task: automated provers are highly complex and optimized programs, often utilizing a parallel computation model, and there is little prior research on the hardware impact on prover performance. To alleviate the problem for Isabelle, we initiated a community benchmark where the build time of HOL-Analysis is measured. On 54 distinct CPUs, a total of 669 runs with different Isabelle configurations were reported by Isabelle users. Results range from 107s to over 11h. We found that current consumer CPUs performed best, with an optimal number of 8 to 16 threads, largely independent of heap memory. As for hardware parameters, CPU base clock affected multi-threaded execution most with a linear correlation of 0.37, whereas boost frequency was the most influential parameter for single-threaded runs (correlation coefficient 0.55); cache size played no significant role. When comparing our benchmark scores with popular high-performance computing benchmarks, we found a strong linear relationship with Dolfyn (R^2 = 0.79) in the single-threaded scenario. Using data from the 3DMark CPU Profile consumer benchmark, we created a linear model for optimal (multi-threaded) Isabelle performance. When validating, the model has an average R^2-score of 0.87; the mean absolute error in the final model corresponds to a wall-clock time of 46.6s. With a dataset of true median values for the 3DMark, the error improves to 37.1s.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset