MRG-Bench is a repository-level code generation dataset designed to evaluate the performance of different models on practical code generation tasks.
Providing the function annotations and function signatures as input and ask the model to generate the corresponding function body
Providing different context, including in_file context, useful function body or signatures, and the whole folder context
Different RAG based methods
PDLR (Potential Data Leakage Rate) measures whether the model remembers information from the open-source repository. Specifically, if the model use an api from the repository without being provided in its context when generating, we consider this a case of potential data leakage. A high PDL-rate indicates that the model's evaluation results are not reliable.
RejR (Rejection Rate) metric evaluates the model's ability to effectively utilize the provided contextual information. For each sample, we compare the model's input context and its generated result. If a function from the provided context appears in the input but is not utilized in the generated result, the sample is marked as rejected. A higher rejection rate indicates that the model struggles to utilize the contextual information effectively, reflecting poorer performance in code generation.