RepoDebug: Repository-Level Multi-Task and Multi-Language Debugging Evaluation of Large Language Models

👤 作者: Jingjing Liu, Zeming Liu, Zihao Cheng, Mengliang He, Xiaoming Shi, Yuhang Guo, Xiangrong Zhu, Yuanfang Guo, Yunhong Wang, Haifeng Wang

💬 备注: 30 pages, 12 figures, EMNLP 2025 Findings

论文速览

The need for this research arises from the limitations of existing debugging datasets that primarily focus on function-level code repair, which does not fully capture the complexities involved in repository-level debugging. Large Language Models (LLMs) have shown promise in automatic program repair, potentially reducing developers' time and increasing efficiency. However, current datasets do not adequately assess LLMs' capabilities in handling diverse tasks, languages, and error types at the repository level, leading to an incomplete understanding of their performance in real-world scenarios.

To address these gaps, the paper proposes RepoDebug, a comprehensive dataset designed for repository-level code debugging. RepoDebug encompasses 22 subtypes of errors across 8 popular programming languages and supports 3 distinct debugging tasks, offering a more realistic and challenging environment for evaluating LLMs. The study evaluates 10 LLMs using this dataset, revealing that even the best-performing model, Claude 3.5 Sonnect, struggles with repository-level debugging. This highlights the need for further advancements in LLMs to effectively tackle complex debugging tasks, suggesting that while LLMs have made strides in code repair, significant challenges remain in achieving proficiency at the repository level.

📖 论文核心内容

1. 主要解决了什么问题？

The core problem addressed by this paper is the evaluation of Large Language Models (LLMs) in repository-level code debugging, which is more complex and realistic than function-level debugging. Existing datasets primarily focus on function-level code repair, leading to an incomplete understanding of LLMs' capabilities in handling repository-level scenarios. This gap is significant because repository-level debugging involves diverse tasks, languages, and error types, which are more reflective of real-world software development challenges. The motivation for this research stems from the need to enhance developer efficiency and reduce debugging time, which are critical in the software engineering domain. The paper identifies the lack of comprehensive datasets that encompass multiple tasks, languages, and error types as a major research gap.

2. 提出了什么解决方案？

The paper proposes RepoDebug, a novel multi-task and multi-language repository-level code debugging dataset. This dataset is designed to address the limitations of existing datasets by supporting 22 subtypes of errors across 8 programming languages and 3 debugging tasks: Bug Identification, Bug Localization, and Automatic Program Repair. The key innovation of RepoDebug lies in its comprehensive coverage of error types and its support for multiple languages, which distinguishes it from previous datasets that are often limited in scope. RepoDebug aims to provide a more realistic and challenging benchmark for evaluating the debugging capabilities of LLMs at the repository level.

3. 核心方法/步骤/策略

The methodology involves constructing the RepoDebug dataset by collecting data from 63 GitHub repositories created after 2022 to avoid data leakage. The dataset includes 22 distinct subtypes of bugs, classified into four categories: syntax errors, logic errors, reference errors, and multiple errors. The construction process involves using abstract syntax trees with the tree-sitter tool to introduce bugs into code files and record their exact locations. The dataset is divided into a training set with 46 repositories and a test set with 17 repositories. The paper also employs automated filtering and manual inspection to ensure the validity of the bugs. The evaluation of LLMs is conducted using four metrics that distinguish between the success rate of identifying single and multiple error locations.

4. 实验设计

The experiments are designed to evaluate the performance of 10 LLMs on the RepoDebug dataset, including 3 open-source and 7 closed-source models. The evaluation metrics focus on the success rate of identifying and repairing errors, with a particular emphasis on the ability to handle multiple error locations. The results reveal that even the best-performing model, Claude 3.5 Sonnect, struggles with repository-level debugging, particularly as the number of errors and code length increase. The experiments also highlight variations in performance across different programming languages, with Java errors being easier to detect and repair compared to others. The difficulty of error types is also assessed, with multiple errors posing the greatest challenge.

5. 结论

The main findings of the paper indicate that current LLMs have significant limitations in repository-level debugging, as evidenced by their performance on the RepoDebug dataset. The study concludes that while LLMs show promise in function-level debugging, their capabilities at the repository level are still inadequate, especially when dealing with complex error scenarios. The paper acknowledges the limitations of the current models and suggests that future research should focus on improving LLMs' ability to handle diverse and complex debugging tasks. Additionally, the paper highlights the need for further development of datasets like RepoDebug to better evaluate and enhance the debugging capabilities of LLMs in real-world scenarios.

🤔 用户关心的问题

How does RepoDebug facilitate the evaluation of LLMs in generating patches for different bug types, such as syntax, logic, and reference errors? The user's interest in automatic program repair includes generating patches for various bug types. This question explores how the RepoDebug dataset, with its diverse error subtypes, supports the evaluation of LLMs in this specific aspect of program repair.
In what ways does RepoDebug assess the ability of LLMs to localize bugs within repository-level code, and how does this compare to function-level debugging? Bug localization is a critical step in program repair. This question aims to understand how the dataset challenges LLMs' capabilities in localizing bugs in more complex, repository-level scenarios compared to simpler function-level tasks.
What methodologies are employed in RepoDebug to evaluate the correctness of patches generated by LLMs, and how do these methodologies address different error types? Evaluating patch correctness is crucial for effective program repair. This question seeks to uncover the specific methodologies used in RepoDebug to assess patch correctness across various error types, aligning with the user's interest in patch validation.
How does RepoDebug incorporate static and dynamic analysis techniques to enhance the reliability of LLM-generated repairs? The user is interested in the interaction between LLMs and analysis techniques to improve repair reliability. This question investigates whether and how RepoDebug integrates such techniques to support more reliable debugging outcomes.
What are the limitations identified in current LLMs' ability to handle multiple error locations within a repository, as revealed by the RepoDebug experiments? Understanding the limitations of LLMs in handling complex error scenarios is essential for advancing automatic program repair. This question probes into the specific challenges faced by LLMs when dealing with multiple error locations, as highlighted by the paper's experimental findings.

💡 逐项解答

How does RepoDebug facilitate the evaluation of LLMs in generating patches for different bug types, such as syntax, logic, and reference errors?

信心指数: 0.90

In what ways does RepoDebug assess the ability of LLMs to localize bugs within repository-level code, and how does this compare to function-level debugging?

RepoDebug offers a unique approach to assessing the ability of large language models (LLMs) to localize bugs within repository-level code, distinguishing itself from traditional function-level debugging tasks. The paper highlights that existing datasets primarily focus on function-level code repair, which limits the understanding of LLMs' capabilities in more complex scenarios. RepoDebug addresses this gap by introducing a multi-task and multi-language repository-level debugging dataset, which includes 22 subtypes of errors across 8 programming languages and supports 3 debugging tasks: Bug Identification, Bug Localization, and Automatic Program Repair. This comprehensive approach allows for a more realistic evaluation of LLMs in handling repository-level code, which is inherently more complex due to the larger codebase and the interdependencies between different parts of the code.

The significance of RepoDebug lies in its ability to challenge LLMs with a broader range of error types and languages, thereby providing a more rigorous test of their debugging capabilities. The paper notes that even the best-performing model, Claude 3.5 Sonnect, struggles with repository-level debugging, particularly as the number of errors increases and the code length grows. This indicates that while LLMs have shown proficiency in function-level debugging, they face substantial challenges in repository-level scenarios, which require understanding the context and structure of larger codebases. The dataset's inclusion of multiple error types, such as syntax, logic, reference, and multiple errors, further tests the models' ability to localize bugs accurately, highlighting the complexity of real-world debugging tasks.

In comparison to function-level debugging, which often involves isolated snippets of code, repository-level debugging requires models to navigate and understand the broader context of a project. This involves not only identifying and locating bugs but also understanding how different parts of the code interact. The paper's evaluation of LLMs on RepoDebug reveals that errors in languages like Java are easier to detect and repair, suggesting that language-specific characteristics may influence debugging difficulty. Overall, RepoDebug provides a more comprehensive and challenging benchmark for evaluating LLMs, emphasizing the need for advancements in their ability to handle complex, repository-level debugging tasks.

信心指数: 0.90

What methodologies are employed in RepoDebug to evaluate the correctness of patches generated by LLMs, and how do these methodologies address different error types?

RepoDebug employs a comprehensive methodology to evaluate the correctness of patches generated by large language models (LLMs) by focusing on a multi-task and multi-language approach. This methodology is particularly significant as it addresses a wide array of error types, which are systematically classified into four main categories: syntax errors, logic errors, reference errors, and multiple errors. The dataset spans 22 distinct subtypes of bugs, providing a robust framework for assessing LLMs' debugging capabilities across different programming languages and tasks. The paper highlights that RepoDebug is constructed using data from 63 GitHub repositories, ensuring a diverse and realistic set of debugging challenges. Each instance in the dataset includes a buggy code file, the subtype of the bug, and its precise location, which is crucial for evaluating the LLMs' ability to identify and repair errors accurately.

The evaluation process in RepoDebug is meticulous, involving both automated filtering and manual inspection to ensure the validity of the bugs. This dual approach helps in maintaining the quality and reliability of the dataset, which is essential for accurately assessing the performance of LLMs. The paper notes that the evaluation experiments conducted on various models reveal significant insights: "Existing large language models exhibit limitations in performance on the RepoDebug dataset," particularly when dealing with multiple errors and longer code segments. This finding underscores the complexity of repository-level debugging tasks and the challenges faced by LLMs in such scenarios.

Moreover, the paper emphasizes the use of four metrics to evaluate the LLMs, distinguishing between the success rate of identifying single and multiple error locations. This nuanced approach allows for a detailed analysis of the models' strengths and weaknesses in handling different error types. The results indicate that while syntactic errors are generally easier for LLMs to detect and repair, multiple errors pose a significant challenge, highlighting the need for further advancements in LLM capabilities. Overall, RepoDebug's methodology provides a comprehensive framework for evaluating patch correctness, addressing the diverse and complex nature of errors encountered in real-world software development.

信心指数: 0.90

How does RepoDebug incorporate static and dynamic analysis techniques to enhance the reliability of LLM-generated repairs?

RepoDebug enhances the reliability of LLM-generated repairs by incorporating both static and dynamic analysis techniques, although the paper primarily emphasizes the static aspect through its construction and evaluation methodologies. The dataset is meticulously constructed using abstract syntax trees (ASTs) via the tree-sitter tool, which is a form of static analysis. This approach allows for the precise classification and location of 22 distinct subtypes of bugs across multiple programming languages, ensuring that the dataset captures a wide range of error types, including syntax, logic, and reference errors. By leveraging ASTs, RepoDebug can systematically introduce and validate bugs, which is crucial for evaluating the effectiveness of LLMs in debugging tasks.

Moreover, the paper highlights that RepoDebug supports three debugging tasks: Bug Identification, Bug Localization, and Automatic Program Repair. These tasks inherently require a combination of static and dynamic analysis techniques to be effectively addressed by LLMs. While the paper does not explicitly detail dynamic analysis methods, the nature of these tasks suggests that dynamic analysis could be involved, particularly in the context of Automatic Program Repair, where understanding the runtime behavior of code is often necessary. The evaluation of LLMs on RepoDebug reveals that even the most advanced models struggle with repository-level debugging, especially as the complexity of the errors increases. This finding underscores the importance of integrating robust analysis techniques to enhance the reliability of LLM-generated repairs.

In summary, while RepoDebug primarily utilizes static analysis through ASTs to construct and evaluate its dataset, the tasks it supports imply a need for dynamic analysis as well. This integration of analysis techniques is essential for improving the reliability and effectiveness of LLMs in real-world debugging scenarios, as evidenced by the challenges faced by current models on the dataset.

信心指数: 0.80

What are the limitations identified in current LLMs' ability to handle multiple error locations within a repository, as revealed by the RepoDebug experiments?

The RepoDebug experiments reveal several limitations in the current capabilities of large language models (LLMs) when it comes to handling multiple error locations within a repository. One of the primary challenges identified is the difficulty LLMs face in managing the complexity of repository-level debugging, which involves multiple tasks, languages, and error types. The paper notes that 'even the most advanced models fall short in repository-level debugging, particularly when the number of errors increases and the code length grows.' This indicates that as the complexity of the debugging task increases, the performance of LLMs diminishes, highlighting a significant limitation in their current design and implementation.

Furthermore, the experiments conducted using the RepoDebug dataset, which includes 22 subtypes of bugs across 8 programming languages, underscore the particular difficulty LLMs have with 'multiple errors,' which are identified as the most challenging type compared to syntactic errors, which are the simplest. This suggests that LLMs struggle with the intricacies of identifying and resolving multiple errors simultaneously, which is a common scenario in real-world repository-level debugging. The paper's findings emphasize the need for more sophisticated approaches to improve LLMs' ability to handle such complex error scenarios effectively.

The significance of these findings lies in their implications for the future development of automatic program repair tools. By identifying these limitations, the paper highlights areas where further research and innovation are needed to enhance the robustness and reliability of LLMs in debugging tasks. This understanding is crucial for advancing the field of automatic program repair and ensuring that LLMs can be effectively utilized in practical software development environments.

信心指数: 0.90

📝 综合总结

RepoDebug significantly enhances the evaluation of Large Language Models (LLMs) in generating patches for various bug types, including syntax, logic, and reference errors, by providing a comprehensive and diverse dataset. The paper highlights that RepoDebug is meticulously constructed with 22 distinct subtypes of bugs, systematically classified into four types: syntax errors, logic errors, reference errors, and multiple errors. This classification allows for a nuanced assessment of LLMs' capabilities across different error types, which is crucial for understanding their proficiency in automatic program repair. The dataset spans eight programming languages and supports three debugging tasks, namely Bug Identification, Bug Localization, and Automatic Program Repair, thereby offering a multi-task and multi-language repository-level debugging environment. This diversity in tasks and languages is essential as it reflects real-world scenarios more accurately than previous datasets, which often focused on function-level code repair capabilities.

Furthermore, the paper underscores the importance of evaluating LLMs in repository-level scenarios, which are more complex and realistic compared to function-level debugging. The inclusion of multiple error types in RepoDebug addresses the limitations of existing datasets that often suffer from limited diversity in tasks, languages, and error types. By conducting evaluation experiments on 10 LLMs, the paper reveals that even the best-performing model, Claude 3.5 Sonnect, struggles with repository-level debugging, particularly as the number of errors increases and the code length grows. This finding highlights the challenges LLMs face in handling complex debugging tasks and underscores the need for datasets like RepoDebug that can facilitate a more comprehensive evaluation. The paper also notes that errors of different types vary in difficulty, with multiple errors being the most challenging and syntactic errors being the simplest, providing further insight into the capabilities and limitations of LLMs in automatic program repair. Overall, RepoDebug's diverse error subtypes and comprehensive evaluation framework play a crucial role in advancing the understanding of LLMs' performance in generating patches for different bug types.

RepoDebug offers a unique approach to assessing the ability of large language models (LLMs) to localize bugs within repository-level code, distinguishing itself from traditional function-level debugging tasks. The paper highlights that existing datasets primarily focus on function-level code repair, which limits the understanding of LLMs' capabilities in more complex scenarios. RepoDebug addresses this gap by introducing a multi-task and multi-language repository-level debugging dataset, which includes 22 subtypes of errors across 8 programming languages and supports 3 debugging tasks: Bug Identification, Bug Localization, and Automatic Program Repair. This comprehensive approach allows for a more realistic evaluation of LLMs in handling repository-level code, which is inherently more complex due to the larger codebase and the interdependencies between different parts of the code.

The significance of RepoDebug lies in its ability to challenge LLMs with a broader range of error types and languages, thereby providing a more rigorous test of their debugging capabilities. The paper notes that even the best-performing model, Claude 3.5 Sonnect, struggles with repository-level debugging, particularly as the number of errors increases and the code length grows. This indicates that while LLMs have shown proficiency in function-level debugging, they face substantial challenges in repository-level scenarios, which require understanding the context and structure of larger codebases. The dataset's inclusion of multiple error types, such as syntax, logic, reference, and multiple errors, further tests the models' ability to localize bugs accurately, highlighting the complexity of real-world debugging tasks.

In comparison to function-level debugging, which often involves isolated snippets of code, repository-level debugging requires models to navigate and understand the broader context of a project. This involves not only identifying and locating bugs but also understanding how different parts of the code interact. The paper's evaluation of LLMs on RepoDebug reveals that errors in languages like Java are easier to detect and repair, suggesting that language-specific characteristics may influence debugging difficulty. Overall, RepoDebug provides a more comprehensive and challenging benchmark for evaluating LLMs, emphasizing the need for advancements in their ability to handle complex, repository-level debugging tasks.

RepoDebug employs a comprehensive methodology to evaluate the correctness of patches generated by large language models (LLMs) by focusing on a multi-task and multi-language approach. This methodology is particularly significant as it addresses a wide array of error types, which are systematically classified into four main categories: syntax errors, logic errors, reference errors, and multiple errors. The dataset spans 22 distinct subtypes of bugs, providing a robust framework for assessing LLMs' debugging capabilities across different programming languages and tasks. The paper highlights that RepoDebug is constructed using data from 63 GitHub repositories, ensuring a diverse and realistic set of debugging challenges. Each instance in the dataset includes a buggy code file, the subtype of the bug, and its precise location, which is crucial for evaluating the LLMs' ability to identify and repair errors accurately.

The evaluation process in RepoDebug is meticulous, involving both automated filtering and manual inspection to ensure the validity of the bugs. This dual approach helps in maintaining the quality and reliability of the dataset, which is essential for accurately assessing the performance of LLMs. The paper notes that the evaluation experiments conducted on various models reveal significant insights: "Existing large language models exhibit limitations in performance on the RepoDebug dataset," particularly when dealing with multiple errors and longer code segments. This finding underscores the complexity of repository-level debugging tasks and the challenges faced by LLMs in such scenarios.

Moreover, the paper emphasizes the use of four metrics to evaluate the LLMs, distinguishing between the success rate of identifying single and multiple error locations. This nuanced approach allows for a detailed analysis of the models' strengths and weaknesses in handling different error types. The results indicate that while syntactic errors are generally easier for LLMs to detect and repair, multiple errors pose a significant challenge, highlighting the need for further advancements in LLM capabilities. Overall, RepoDebug's methodology provides a comprehensive framework for evaluating patch correctness, addressing the diverse and complex nature of errors encountered in real-world software development.

RepoDebug enhances the reliability of LLM-generated repairs by incorporating both static and dynamic analysis techniques, although the paper primarily emphasizes the static aspect through its construction and evaluation methodologies. The dataset is meticulously constructed using abstract syntax trees (ASTs) via the tree-sitter tool, which is a form of static analysis. This approach allows for the precise classification and location of 22 distinct subtypes of bugs across multiple programming languages, ensuring that the dataset captures a wide range of error types, including syntax, logic, and reference errors. By leveraging ASTs, RepoDebug can systematically introduce and validate bugs, which is crucial for evaluating the effectiveness of LLMs in debugging tasks.

Moreover, the paper highlights that RepoDebug supports three debugging tasks: Bug Identification, Bug Localization, and Automatic Program Repair. These tasks inherently require a combination of static and dynamic analysis techniques to be effectively addressed by LLMs. While the paper does not explicitly detail dynamic analysis methods, the nature of these tasks suggests that dynamic analysis could be involved, particularly in the context of Automatic Program Repair, where understanding the runtime behavior of code is often necessary. The evaluation of LLMs on RepoDebug reveals that even the most advanced models struggle with repository-level debugging, especially as the complexity of the errors increases. This finding underscores the importance of integrating robust analysis techniques to enhance the reliability of LLM-generated repairs.

In summary, while RepoDebug primarily utilizes static analysis through ASTs to construct and evaluate its dataset, the tasks it supports imply a need for dynamic analysis as well. This integration of analysis techniques is essential for improving the reliability and effectiveness of LLMs in real-world debugging scenarios, as evidenced by the challenges faced by current models on the dataset.

The RepoDebug experiments reveal several limitations in the current capabilities of large language models (LLMs) when it comes to handling multiple error locations within a repository. One of the primary challenges identified is the difficulty LLMs face in managing the complexity of repository-level debugging, which involves multiple tasks, languages, and error types. The paper notes that 'even the most advanced models fall short in repository-level debugging, particularly when the number of errors increases and the code length grows.' This indicates that as the complexity of the debugging task increases, the performance of LLMs diminishes, highlighting a significant limitation in their current design and implementation.

Furthermore, the experiments conducted using the RepoDebug dataset, which includes 22 subtypes of bugs across 8 programming languages, underscore the particular difficulty LLMs have with 'multiple errors,' which are identified as the most challenging type compared to syntactic errors, which are the simplest. This suggests that LLMs struggle with the intricacies of identifying and resolving multiple errors simultaneously, which is a common scenario in real-world repository-level debugging. The paper's findings emphasize the need for more sophisticated approaches to improve LLMs' ability to handle such complex error scenarios effectively.

The significance of these findings lies in their implications for the future development of automatic program repair tools. By identifying these limitations, the paper highlights areas where further research and innovation are needed to enhance the robustness and reliability of LLMs in debugging tasks. This understanding is crucial for advancing the field of automatic program repair and ensuring that LLMs can be effectively utilized in practical software development environments.