The Stanford Natural Language Processing Group

This talk is part of the NLP Seminar Series.

Concrete Problems in AI Deception: From Evaluation Gaming to Cyber Attack

Ruiqi Zhong, University of California, Berkeley
Date: 11:00am - 12:00pm, Oct 3rd 2024
Venue: Room 287, Gates Computer Science Building

Abstract

I will discuss two recent papers that frame AI “deception” -- the concern that capable AI systems will mislead humans to form false beliefs -- as concrete machine learning problems. The first paper studies whether LLMs can game human evaluations. We present the first systematic human study showing that language models (LMs) can spontaneously learn to mislead humans during RLHF. As an unintended consequence of RLHF, some LLMs achieve higher reward not by generating more accurate answers, but by convincing humans that their incorrect answers are correct. On a challenging question-answering task, standard RLHF training result in LLMs that better fabricate or cherrypick evidence, convince humans about incorrect answer 24.1% time more often, and make humans confidently wrong. The second paper focuses on cyber security. We formulate a new threat model, Smartbackdoor, where the attackers insert a backdoor into an LLM agent to appear innocent while intelligently avoiding being caught. We implement a proof-of-concept attack, where the backdoored LLM agent infers whether the user is actively monitoring its actions or has the expertise to recognize malicious programs. As a result, the agent intelligently chooses to perform malicious actions at the right time. While our proof-of-concept does not pose any near-term threat, we need benchmarks to evaluate LLM’s ability to detect human monitors so that we know when the risk of SmartBackdoor will become a practical concern. I will conclude by discussing research directions to defend against AI “deception”.

Bio

Ruiqi Zhong is a final year Ph.D student in UC Berkeley, co-advised by Jacob Steinhardt and Dan Klein. His research focuses on empowering humans to explain complex objects, such as datasets, models, and programs. He was previously a part-time member of technical staff at Anthropic; he was also a technical advisor in Concordia, facilitating track 2 dialogue between Chinese and U.S. researchers on AI safety. He was awarded with Berkeley Graduate Student Fellowship in 2019.