Introduction to MRG-Bench

MRG-Bench is a repository-level code generation dataset designed to evaluate the performance of different models on practical code generation tasks.

Leaderboard

Direct Generation Results

Providing the function annotations and function signatures as input and ask the model to generate the corresponding function body

Context Methods Results

Providing different context, including in_file context, useful function body or signatures, and the whole folder context

RAG Methods Results

Different RAG based methods

PDLR Scores

PDLR (Potential Data Leakage Rate) measures whether the model remembers information from the open-source repository. Specifically, if the model use an api from the repository without being provided in its context when generating, we consider this a case of potential data leakage. A high PDL-rate indicates that the model's evaluation results are not reliable.

RejR Scores

RejR (Rejection Rate) metric evaluates the model's ability to effectively utilize the provided contextual information. For each sample, we compare the model's input context and its generated result. If a function from the provided context appears in the input but is not utilized in the generated result, the sample is marked as rejected. A higher rejection rate indicates that the model struggles to utilize the contextual information effectively, reflecting poorer performance in code generation.