CodeArenaEval

Abstract

The current codeLLMs focus on synthesizing the correct code snippet, ignoring the alignment with human preferences, where the query should sampled from the practical application scenarios and the model-generated responses should satisfy the human preference. To bridge the gap between the modelgenerated response and human preference, we present a rigorous human-curated benchmark codearena to emulate the complexity and diversity of real-world coding tasks, where 397 high-quality samples spanning 40 categories, carefully curated from user queries. Further, we propose a diverse synthetic instruction corpus syncode-instruct (nearly 10B tokens) by scaling instructions from the website. The results find performance differences between code execution-based benchmarks and CodeArena. Our systematic experiments of CodeArena on 20+ LLMs reveal a notable performance gap between open codeLLMs (e.g. Qwen-Coder) and closed-source LLMs (e.g., o1 and Claude series), underscoring the importance of the alignment of the human preference.

Introduction

The contributions in this paper are summarized as follows: (1) We propose codearena comprised of 397 manually annotated samples, a comprehensive code evaluation benchmark for evaluating the alignment between the model-generated response and human preference, which covers 40 categories, encompassing 7 major categories and 40 subcategories. (2) We introduce synCode-instruct, the large-scale synthetic code instruction corpora from the website. Based on SynCode-Instruct, an effective coder Qwen2.5-SynCoder is used as a strong baseline for CodeArena. (3) We systematically evaluate 39 LLMs on created CodeArena and create a leaderboard to dynamically update the results. Notably, extensive experiments suggest that CodeArena can effectively measure the alignment between the model-generated response and human preference.

CodeArena

Dataset Statistics. As shown in Figure 2 and Table 1, CodeArena consists of nearly 400 problems.All samples can be classified into 7 main classes and 40 subclasses. After word segment using Qwen2.5-Coder tokenizer (Hui et al., 2024), the average question length ranges from 5 to 6736 tokens with an average of 291 tokens.

Multiple Programming Languages. Figure 3 plots the distribution of programming languages, where we strive to cover common programming languages in CodeArena.

Difficulty levels of Codearena. Figure 4 plots the difficulty levels of CodeArena, where all samples are classified into easy, medium, and hard. The middle and hard questions account for the majority of the samples, posing a tough challenge to LLM.

Evaluation. Inspired by the previous work (Chiang et al., 2024), we apply GPT-4o-2024-08-06 as the judger to evaluate the model performance. Specifically, we use two games “compare A and B ” and “compare B and A” (avoid the relative position of A and B affecting the results) to calculate the win rate of A compared to the baseline B.

Decontainmation. To avoid data leakage, we apply data decontamination to decontaminate the prompts from the CodeArena, by removing exact matches (15-gram word overlap) from HumanEval (Chen et al., 2021a), MBPP (Austin et al., 2021), MultiPL-E (Chen et al., 2021a), McEval (Chen et al., 2021a), NaturalCodeBench (Zhang et al., 2024), DS-1000 (Lai et al., 2023), and MultiPL-E (Cobbe et al., 2021).

Comparison with other benchmarks. We compare CodeArena with other code benchmarks. Our benchmark provides a valuable comprehensive benchmark for 40 subtasks and 44 programming languages.

SynCode-Instruct

Recall from Common crawl. A trained fasttext is used to distinguish the code-related text and other common raw text, which is used to recall and clean potential code data and filter out low-quality content using weak model-based classifiers and scorers. Our approach encompasses both file-level and repository-level pertaining to ensure comprehensive coverage.

Code Classification for Code Snippet. We extract the first layer of CodeBERT (Feng et al., 2020) and fine-tune the tiny classifier on nearly 100 programming languages to build a language identification model. We keep the main language data (e.g. C, Python, and Java) and downsample high resource language data (e.g. HTML and Java) to keep the balance. Besides, we also remove the samples with no code snippets.

Scaling Code Instruction. Initially, we adopt rule-based filtering to clean pre-extracted content from recalled documents by removing site information, advertisements, and HTML tags, thereby significantly reducing document length for further processing. Different from the previous work (Yue et al., 2024), we utilize Qwen2.5-72B to create new questions instead of extracting question and answer pairs. The synthetic instruction corpora generated by Qwen2.5 is used for the first stage and the high-quality data from GPT-4o is used for the second stage.

Experimental Setup.

Instruction Dataset

CodeLLMs. We evaluate 23 models with sizes ranging from 7B to 72B parameters, including general/code LLMs, open/closed-source models, and base/instruction models.

Evaluation Benchmark

Evalplus and Multipl-E. The EvalPlus (Liu et al., 2023) is a upgraded version of the HumanEval (Chen et al., 2021a) and MBPP (Austin et al., 2021) to test the code generation capabilities. The benchmark reports the scores of HumanEval (HE)/MBPP with base test cases and HumanEval+ (HE+)/MBPP+ with plus test cases.

Multipl-E. The MultiPL-E test set (Cassano et al., 2023) contains the HumanEval (Python) and translated test set of other programming languages, i.e., Java, C++, Javascript, and Typescript.

CodeArena. CodeArena Different from the EvalPlus and MultiPL-E, CodeArena consists of many nonalgorihtmic, which is not suitable for code execution-based evaluation. Each question is scored twice to calculate the win rate and tie rate by GPT-4o using a different input order “A,B” and “B, A”, where “A” is the baseline from gpt-4-turbo-2024-04-09 and “B” is the model-generated response.

Evaluation Metrics

Pass@k. Given the model-generated response, we extract the expected function and feed the test cases into the extracted function to verify the correctness of the generation. We adopt greedy Pass@1 (Chen et al., 2021a) to report the results on EvalPlus and MultiPL-E.

LLM as a judgement. Due to the high cost of collecting human preferences (Zheng et al., 2023a), we use pairwise comparison for judgment, where an LLM judger is fed with a question and two answers and determines which one is better or declares a tie1. We report win rate/tie rate for CodeArena.

Impletmentation Details.

We fine-tune Qwen2.5-Coder-32B on nearly 20B synthetic tokens generated from website data, where GPT-4o generates 1B tokens and Qwen2.5-Coder-Instruct generates the left tokens. Qwen2.5-SynCoder is fine-tuned on the synthetic instruction corpus SynCode-Instruct with 256 NVIDIA A100-80GB GPUs. The learning rate first increases into 8 × 10−5 with 100 warmup steps and then adopts a cosine decay scheduler. We adopt the Adam optimizer (Kingma and Ba, 2015) with a global batch size of 2048 samples and a tensor parallel size of 8, truncating sentences to 32K tokens.

Results and Discussion.

Main Results

Codearena. Table 3, the leaderboard, shows that the win rate/tie rate of different instruction LLM on CodeArena.

EvalPlus and Multipl-E. Table 4 shows that Qwen2.5-SynCoder significantly beats previous strong opensource baselines using large-scale synthetic instruction, closing the gap with GPT-4o and Claude, which verifies that the large-scale synthetic data can bring significant improvement for the base model.

Discussion

Examples of CodeArena. Figure 7 lists six examples from the different subtasks, covering Python, HTML, CSS, and Java. Different from the previous benchmarks (Cassano et al., 2023; Jain et al., 2024) comprised of algorithmic questions in a fixed format, the queries of CodeArena are more consistent with the distribution of user questions in real Q&A scenarios.

Difference between CodeArena and Execution Based benchmark. Compared to the benchmark MultiPL-E evaluated by code execution, CodeArena is created from real-world Q&A and evaluated by LLM-as-a-judge to evaluate the alignment between the model-generated response and human preference.

Scaling Synthetic Instruction Corpora. We would like to further analyze the performance of Qwen2.5-SynCoder in MultiPl-E and CodeArena given different sizes of instruction corpora. Therefore, we select the full instruction (19B synthetic data is at the front of the data and 1B high-quality data is at the end) set SynCode-Instruct and extract the first K billion tokens as the fine-tuned data. We set K = {2, 4, . . . , 20}. We randomly extract specific data from the whole sentence pairs. Figure 9 shows the performance on CodeArena. With the increase of instruction data, Qwen2.5-SynCoder still can get significant improvement, which emphasizes the importance of the scaling instruction corpora. Besides, the two-stage SFT gets a better performance compared to the one-stage training (red line), where the high-quality data brings a huge improvement at last.

Distribution of different benchmarks. We visualize the queries of CodeArena and MultiPL-E (Python, Java, and CPP) by extracting the encoder representations of the last layer for t-SNE (Van der Maaten and Hinton, 2008). It shows that the distribution of queries in CodeArena is very diverse, which is suitable for evaluating human preferences in realistic scenarios

Conclusion

In this work, We introduce CodeArena, a meticulously humancurated benchmark composed of 397 high-quality samples spanning 40 categories, derived from real-world user queries, to address discrepancies between model-generated responses and human preferences in coding tasks. Additionally, we create SynCode-Instruct, a diverse synthetic instruction corpus containing nearly 20 billion tokens, by scaling web-sourced instructions. Our evaluation of over 20 large language models (LLMs) using CodeArena highlights significant performance discrepancies between code-execution-based benchmarks and our humancurated benchmark. Notably, there is a marked performance gap between open-source code LLMs (such as DeepSeek-Coder) and closed-source LLMs (such as the o1 and Claude series), underscoring the importance of aligning AI models with human preferences in coding tasks.

Citation
@inproceedings{
title={Evaluating and Aligning CodeLLMs on Human Preference},
author={Jian Yang, Jiaxi Yang, Ke Jin, Yibo Miao, Lei Zhang, Liqun Yang, Zeyu Cui, Yichang Zhang, Binyuan Hui, Junyang Lin},
booktitle={},
year={2025},
url={} }