Measuring The Impact Of Programming Language Distribution

02/03/2023
by   Gabriel Orlanski, et al.
0

Current benchmarks for evaluating neural code models focus on only a small subset of programming languages, excluding many popular languages such as Go or Rust. To ameliorate this issue, we present the BabelCode framework for execution-based evaluation of any benchmark in any language. BabelCode enables new investigations into the qualitative performance of models' memory, runtime, and individual test case results. Additionally, we present a new code translation dataset called Translating Python Programming Puzzles (TP3) from the Python Programming Puzzles (Schuster et al. 2021) benchmark that involves translating expert-level python functions to any language. With both BabelCode and the TP3 benchmark, we investigate if balancing the distributions of 14 languages in a training dataset improves a large language model's performance on low-resource languages. Training a model on a balanced corpus results in, on average, 12.34 baseline. We find that this strategy achieves 66.48 low-resource languages at the cost of only a 12.94 languages. In our three translation tasks, this strategy yields, on average, 30.77 pass@k.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset