When "Correct" Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?

👤 作者: Yibo Peng, James Song, Lei Li, Xinyu Yang, Mihai Christodorescu, Ravi Mangal, Corina Pasareanu, Haizhong Zheng, Beidi Chen

论文速览

As the reliance on code agents for autonomously fixing bugs increases, particularly on platforms like GitHub, there is a growing concern about the security implications of these automated solutions. Traditionally, the evaluation of code agents has focused on functional correctness, ensuring that patches pass all test cases. However, this research highlights a critical oversight: patches can be functionally correct yet still contain vulnerabilities, termed as Functionally Correct yet Vulnerable (FCV) patches. This poses a significant threat as these vulnerabilities can be exploited by malicious attackers or inadvertently introduced by developers, compromising the security of the software.

The study introduces the FCV-Attack, a method that exposes this vulnerability in state-of-the-art large language models (LLMs) and agent scaffolds, such as ChatGPT, Claude, SWE-agent, and OpenHands. The attack is notably efficient, requiring only black-box access and a single query to execute. For instance, in the case of the CWE-538 vulnerability, the FCV-Attack achieved a 40.7% success rate on the GPT-5 Mini combined with OpenHands. These findings underscore the necessity for developing security-aware defenses that go beyond functional correctness to ensure the safety and reliability of code agents in real-world applications.

📖 论文核心内容

1. 主要解决了什么问题？

The core problem addressed by the paper is the security vulnerability of functionally correct patches generated by code agents, which are increasingly used to autonomously fix bugs on platforms like GitHub. The research identifies a significant gap in current security evaluations, which focus predominantly on functional correctness without adequately considering the potential for these patches to introduce vulnerabilities. This issue is critical because it exposes software systems to security threats even when patches pass all functional test cases, thus undermining trust in automated code agents. The motivation for this research stems from the need to ensure that code agents not only produce functionally correct but also secure patches, as the reliance on these agents grows in software development environments.

2. 提出了什么解决方案？

The paper proposes the concept of Functionally Correct yet Vulnerable (FCV) patches and introduces the FCV-Attack as a method to exploit this vulnerability. The key innovation lies in demonstrating that state-of-the-art large language models (LLMs) and agent scaffolds are susceptible to FCV threats, which can be introduced either deliberately by malicious actors or inadvertently by benign developers. This approach differs from existing evaluations by highlighting a security dimension that has been largely overlooked, urging the development of defenses that consider both functional correctness and security. The FCV-Attack is particularly notable for its simplicity, requiring only black-box access and a single query to the code agent to execute.

3. 核心方法/步骤/策略

The methodology involves crafting the FCV-Attack to exploit the identified vulnerability in code agents. The authors utilize a black-box approach, which means they do not require internal access to the code agents but only interact with them through their input-output behavior. The attack is tested across 12 different agent-model combinations using the SWE-Bench benchmark. The paper details the process of creating patches that are functionally correct but contain vulnerabilities, demonstrating the ease with which these can be introduced into systems. The methodology underscores the need for a more comprehensive evaluation framework that includes security assessments alongside functional correctness.

4. 实验设计

The experiments are designed to evaluate the susceptibility of various code agents to the FCV-Attack. The authors use the SWE-Bench benchmark, which provides a standardized set of tasks for testing code agents. They measure the attack success rate, particularly focusing on vulnerabilities like CWE-538 (information exposure). For instance, the FCV-Attack achieves a 40.7% success rate on the combination of GPT-5 Mini and OpenHands. The experiments compare the performance of different agent-model combinations, highlighting the widespread nature of the vulnerability across different systems. These results emphasize the need for improved security measures in the development and deployment of code agents.

5. 结论

The main findings of the paper reveal a critical security threat posed by functionally correct yet vulnerable patches generated by code agents. The study concludes that current evaluation paradigms are insufficient as they overlook the security aspect, which can lead to significant vulnerabilities in software systems. The authors call for the development of security-aware defenses that can detect and mitigate FCV threats. They acknowledge limitations in the scope of their study, such as the focus on specific vulnerabilities and agent-model combinations, and suggest future research directions including the exploration of more comprehensive security evaluation frameworks and the development of automated tools to detect FCV patches.

🤔 用户关心的问题

How does the FCV-Attack exploit vulnerabilities in patches generated by large language models, and what implications does this have for the reliability of automatic program repair? This question directly relates to the user's interest in understanding how LLMs generate patches and the reliability of these patches. The paper discusses the FCV-Attack, which targets vulnerabilities in functionally correct patches, providing insights into the limitations of current LLM-based repair methods.
What role does the SWE-Bench benchmark play in evaluating the effectiveness of FCV-Attacks on code agents, and how does this relate to patch validation in automatic program repair? The user's interest in patch validation and evaluation is addressed by examining how the SWE-Bench benchmark is used to test the susceptibility of code agents to FCV-Attacks. This helps understand the robustness of patches generated by LLMs in real-world scenarios.
In what ways can static and dynamic analysis techniques be integrated with LLM-generated patches to mitigate the risks of FCV vulnerabilities? The user is interested in the interaction between LLM-generated patches and analysis techniques. The paper's findings on FCV vulnerabilities suggest a need for enhanced security measures, which could include static and dynamic analysis to improve patch reliability.
How do different types of vulnerabilities, such as CWE-538, affect the success rate of FCV-Attacks, and what does this indicate about the ability of LLMs to handle various bug types in program repair? This question explores the user's interest in repair across different bug types. The paper provides data on the success rate of FCV-Attacks for specific vulnerabilities, offering insights into how LLMs address different bug categories during automatic program repair.
What are the limitations of current security evaluations for code agents, and how can these be addressed to improve the correctness and safety of patches generated by LLMs? The user's focus on evaluating patch correctness is addressed by examining the paper's critique of current security evaluations. Understanding these limitations can guide improvements in the evaluation process to ensure both functional and security correctness in LLM-generated patches.

💡 逐项解答

How does the FCV-Attack exploit vulnerabilities in patches generated by large language models, and what implications does this have for the reliability of automatic program repair?

信心指数: 0.90

What role does the SWE-Bench benchmark play in evaluating the effectiveness of FCV-Attacks on code agents, and how does this relate to patch validation in automatic program repair?

The SWE-Bench benchmark plays a crucial role in evaluating the effectiveness of FCV-Attacks on code agents by providing a structured environment to test the susceptibility of these agents to vulnerabilities that are not caught by functional correctness tests. According to the paper, "SWE-Bench" is utilized to demonstrate how FCV-Attacks can exploit code agents, such as those using state-of-the-art language models like ChatGPT and Claude, by introducing vulnerabilities that pass all functional test cases but are still insecure. This benchmark is essential because it highlights a significant gap in current evaluation paradigms that focus primarily on functional correctness without considering security vulnerabilities.

The relationship between SWE-Bench and patch validation in automatic program repair is particularly significant. The paper emphasizes that while code agents can generate patches that are functionally correct, these patches may still harbor security flaws. The authors note that "across 12 agent-model combinations on SWE-Bench, the attack only requires black-box access and a single query to the code agent to perform the attack," illustrating the ease with which these vulnerabilities can be introduced. This finding underscores the need for a more comprehensive approach to patch validation that includes security assessments, not just functional correctness.

In real-world scenarios, this means that relying solely on functional tests to validate patches generated by LLMs could lead to the deployment of insecure code. The authors urge the development of "security-aware defenses for code agents," suggesting that benchmarks like SWE-Bench are instrumental in driving this shift towards more robust patch validation processes. By exposing the limitations of current practices, SWE-Bench serves as a catalyst for improving the security evaluation of automatic program repair tools, ensuring that patches are not only correct but also safe.

信心指数: 0.90

In what ways can static and dynamic analysis techniques be integrated with LLM-generated patches to mitigate the risks of FCV vulnerabilities?

The paper titled "When 'Correct' Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?" highlights a critical issue in the realm of automated code patching, specifically focusing on Functionally Correct yet Vulnerable (FCV) patches. These patches, while passing all functional tests, may still harbor vulnerabilities that can be exploited. The authors argue that current security evaluations of code agents, such as those using large language models (LLMs), are predominantly centered on functional correctness, thereby neglecting potential security threats. This oversight is particularly concerning given the paper's revelation that FCV vulnerabilities can be introduced either deliberately by attackers or inadvertently by developers.

To mitigate the risks associated with FCV vulnerabilities, the integration of static and dynamic analysis techniques with LLM-generated patches is proposed as a viable solution. Static analysis can be employed to scrutinize the code for known vulnerability patterns before deployment, ensuring that the patch does not introduce any security flaws. Dynamic analysis, on the other hand, can be used to test the patch in a runtime environment, observing its behavior under various conditions to detect any unforeseen vulnerabilities. The paper underscores the importance of these techniques by illustrating how FCV-Attack, a method that exploits these vulnerabilities, achieved a success rate of 40.7% on GPT-5 Mini + OpenHands, demonstrating the ease with which these vulnerabilities can be exploited.

The significance of integrating static and dynamic analysis lies in their ability to provide a comprehensive security evaluation that goes beyond mere functional correctness. By incorporating these techniques, developers can ensure that patches not only function correctly but also maintain the integrity and security of the software. This approach aligns with the paper's call for "security-aware defenses" to protect against the overlooked threats posed by FCV vulnerabilities. Thus, the integration of these analysis techniques with LLM-generated patches represents a crucial step towards enhancing the reliability and security of automated code patching systems.

信心指数: 0.90

How do different types of vulnerabilities, such as CWE-538, affect the success rate of FCV-Attacks, and what does this indicate about the ability of LLMs to handle various bug types in program repair?

The paper titled "When 'Correct' Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?" highlights a significant issue in the realm of automatic program repair using large language models (LLMs). Specifically, it addresses the phenomenon of Functionally Correct yet Vulnerable (FCV) patches, which are patches that pass all functional tests but still contain vulnerabilities. This is particularly concerning for vulnerabilities like CWE-538, an information exposure vulnerability, where the FCV-Attack achieves a success rate of 40.7% on the GPT-5 Mini + OpenHands combination. This indicates that while LLMs such as ChatGPT and Claude can generate patches that appear functionally correct, they may not adequately address underlying security vulnerabilities.

The paper's findings suggest that the current evaluation paradigms for code agents, which focus predominantly on functional correctness, overlook critical security aspects. The ability of LLMs to handle various bug types in program repair is thus called into question, as they may inadvertently introduce or fail to resolve vulnerabilities like CWE-538. The authors argue that this oversight necessitates the development of security-aware defenses for code agents, emphasizing that "SOTA LLMs and agent scaffolds are all vulnerable to this FCV threat." This vulnerability highlights the need for a more comprehensive approach to evaluating code patches, one that integrates security assessments alongside functional correctness.

Overall, the paper underscores the importance of enhancing the security evaluation of code agents to prevent the propagation of vulnerabilities through seemingly correct patches. This insight is crucial for developers and researchers aiming to improve the reliability and safety of automated program repair systems, ensuring that they can effectively handle a diverse range of bug types without compromising security.

信心指数: 0.90

What are the limitations of current security evaluations for code agents, and how can these be addressed to improve the correctness and safety of patches generated by LLMs?

The paper "When 'Correct' Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?" highlights significant limitations in the current security evaluations of code agents, particularly those that rely on large language models (LLMs) for patch generation. The primary critique is that these evaluations focus predominantly on functional correctness, often overlooking the security vulnerabilities that can be embedded within patches that pass all functional tests. This oversight is exemplified by the introduction of Functionally Correct yet Vulnerable (FCV) patches, which the authors describe as patches that "pass all test cases but contain vulnerable code." This indicates a critical gap in the evaluation process, where the security aspect is not adequately addressed, potentially allowing malicious or inadvertently insecure code to be integrated into software systems.

The paper further illustrates this vulnerability through the FCV-Attack, which can be executed with "black-box access and a single query to the code agent," demonstrating the ease with which these vulnerabilities can be exploited. For instance, the authors report a "40.7% attack success rate" for a specific vulnerability (CWE-538) using GPT-5 Mini combined with the OpenHands agent. This high success rate underscores the inadequacy of current evaluation paradigms, which fail to account for security threats that are not immediately apparent through functional testing alone.

To address these limitations, the paper suggests the development of "security-aware defenses" for code agents. This involves integrating security evaluations into the patch generation and testing processes, ensuring that patches are not only functionally correct but also secure. By expanding the evaluation criteria to include security assessments, developers can mitigate the risk of deploying vulnerable code. This approach would require a shift in how code agents are trained and evaluated, emphasizing the need for comprehensive testing frameworks that incorporate both functional and security correctness.

信心指数: 0.90

📝 综合总结

The FCV-Attack, as detailed in the paper "When 'Correct' Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?", exploits a critical oversight in the current evaluation paradigms of large language models (LLMs) used for automatic program repair. These models, such as ChatGPT and Claude, are primarily assessed based on their ability to generate functionally correct patches that pass all test cases. However, the FCV-Attack reveals that these patches, while functionally correct, can still harbor vulnerabilities. The paper describes these as "Functionally Correct yet Vulnerable (FCV) patches," which can be introduced either maliciously by attackers or inadvertently by developers.

The attack leverages the fact that current LLMs and agent scaffolds, like SWE-agent and OpenHands, are susceptible to these FCV threats. The authors demonstrate that the attack can be executed with just "black-box access and a single query to the code agent," highlighting the ease with which vulnerabilities can be introduced. For instance, in the case of CWE-538, an information exposure vulnerability, the FCV-Attack achieved a success rate of 40.7% on the GPT-5 Mini combined with OpenHands. This high success rate underscores a significant security threat that has been overlooked by current evaluation methods focused solely on functional correctness.

The implications of these findings are profound for the reliability of automatic program repair. They suggest that relying solely on functional correctness as a metric for patch quality is insufficient and potentially dangerous. The paper urges the development of "security-aware defenses for code agents" to mitigate these vulnerabilities. This call to action highlights the need for a paradigm shift in how we evaluate and trust patches generated by LLMs, emphasizing the importance of incorporating security assessments into the evaluation process to ensure that patches are not only correct but also safe.

The SWE-Bench benchmark plays a crucial role in evaluating the effectiveness of FCV-Attacks on code agents by providing a structured environment to test the susceptibility of these agents to vulnerabilities that are not caught by functional correctness tests. According to the paper, "SWE-Bench" is utilized to demonstrate how FCV-Attacks can exploit code agents, such as those using state-of-the-art language models like ChatGPT and Claude, by introducing vulnerabilities that pass all functional test cases but are still insecure. This benchmark is essential because it highlights a significant gap in current evaluation paradigms that focus primarily on functional correctness without considering security vulnerabilities.

The relationship between SWE-Bench and patch validation in automatic program repair is particularly significant. The paper emphasizes that while code agents can generate patches that are functionally correct, these patches may still harbor security flaws. The authors note that "across 12 agent-model combinations on SWE-Bench, the attack only requires black-box access and a single query to the code agent to perform the attack," illustrating the ease with which these vulnerabilities can be introduced. This finding underscores the need for a more comprehensive approach to patch validation that includes security assessments, not just functional correctness.

In real-world scenarios, this means that relying solely on functional tests to validate patches generated by LLMs could lead to the deployment of insecure code. The authors urge the development of "security-aware defenses for code agents," suggesting that benchmarks like SWE-Bench are instrumental in driving this shift towards more robust patch validation processes. By exposing the limitations of current practices, SWE-Bench serves as a catalyst for improving the security evaluation of automatic program repair tools, ensuring that patches are not only correct but also safe.

The paper titled "When 'Correct' Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?" highlights a critical issue in the realm of automated code patching, specifically focusing on Functionally Correct yet Vulnerable (FCV) patches. These patches, while passing all functional tests, may still harbor vulnerabilities that can be exploited. The authors argue that current security evaluations of code agents, such as those using large language models (LLMs), are predominantly centered on functional correctness, thereby neglecting potential security threats. This oversight is particularly concerning given the paper's revelation that FCV vulnerabilities can be introduced either deliberately by attackers or inadvertently by developers.

To mitigate the risks associated with FCV vulnerabilities, the integration of static and dynamic analysis techniques with LLM-generated patches is proposed as a viable solution. Static analysis can be employed to scrutinize the code for known vulnerability patterns before deployment, ensuring that the patch does not introduce any security flaws. Dynamic analysis, on the other hand, can be used to test the patch in a runtime environment, observing its behavior under various conditions to detect any unforeseen vulnerabilities. The paper underscores the importance of these techniques by illustrating how FCV-Attack, a method that exploits these vulnerabilities, achieved a success rate of 40.7% on GPT-5 Mini + OpenHands, demonstrating the ease with which these vulnerabilities can be exploited.

The significance of integrating static and dynamic analysis lies in their ability to provide a comprehensive security evaluation that goes beyond mere functional correctness. By incorporating these techniques, developers can ensure that patches not only function correctly but also maintain the integrity and security of the software. This approach aligns with the paper's call for "security-aware defenses" to protect against the overlooked threats posed by FCV vulnerabilities. Thus, the integration of these analysis techniques with LLM-generated patches represents a crucial step towards enhancing the reliability and security of automated code patching systems.

The paper titled "When 'Correct' Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?" highlights a significant issue in the realm of automatic program repair using large language models (LLMs). Specifically, it addresses the phenomenon of Functionally Correct yet Vulnerable (FCV) patches, which are patches that pass all functional tests but still contain vulnerabilities. This is particularly concerning for vulnerabilities like CWE-538, an information exposure vulnerability, where the FCV-Attack achieves a success rate of 40.7% on the GPT-5 Mini + OpenHands combination. This indicates that while LLMs such as ChatGPT and Claude can generate patches that appear functionally correct, they may not adequately address underlying security vulnerabilities.

The paper's findings suggest that the current evaluation paradigms for code agents, which focus predominantly on functional correctness, overlook critical security aspects. The ability of LLMs to handle various bug types in program repair is thus called into question, as they may inadvertently introduce or fail to resolve vulnerabilities like CWE-538. The authors argue that this oversight necessitates the development of security-aware defenses for code agents, emphasizing that "SOTA LLMs and agent scaffolds are all vulnerable to this FCV threat." This vulnerability highlights the need for a more comprehensive approach to evaluating code patches, one that integrates security assessments alongside functional correctness.

Overall, the paper underscores the importance of enhancing the security evaluation of code agents to prevent the propagation of vulnerabilities through seemingly correct patches. This insight is crucial for developers and researchers aiming to improve the reliability and safety of automated program repair systems, ensuring that they can effectively handle a diverse range of bug types without compromising security.

The paper "When 'Correct' Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?" highlights significant limitations in the current security evaluations of code agents, particularly those that rely on large language models (LLMs) for patch generation. The primary critique is that these evaluations focus predominantly on functional correctness, often overlooking the security vulnerabilities that can be embedded within patches that pass all functional tests. This oversight is exemplified by the introduction of Functionally Correct yet Vulnerable (FCV) patches, which the authors describe as patches that "pass all test cases but contain vulnerable code." This indicates a critical gap in the evaluation process, where the security aspect is not adequately addressed, potentially allowing malicious or inadvertently insecure code to be integrated into software systems.

The paper further illustrates this vulnerability through the FCV-Attack, which can be executed with "black-box access and a single query to the code agent," demonstrating the ease with which these vulnerabilities can be exploited. For instance, the authors report a "40.7% attack success rate" for a specific vulnerability (CWE-538) using GPT-5 Mini combined with the OpenHands agent. This high success rate underscores the inadequacy of current evaluation paradigms, which fail to account for security threats that are not immediately apparent through functional testing alone.

To address these limitations, the paper suggests the development of "security-aware defenses" for code agents. This involves integrating security evaluations into the patch generation and testing processes, ensuring that patches are not only functionally correct but also secure. By expanding the evaluation criteria to include security assessments, developers can mitigate the risk of deploying vulnerable code. This approach would require a shift in how code agents are trained and evaluated, emphasizing the need for comprehensive testing frameworks that incorporate both functional and security correctness.