Building Human-AI Alignment: Specifying, Inspecting, and Modeling AI Behaviors | Serena Booth
GPS, Robinson Building Complex (RBC), 3106Abstract: The learned behaviors of AI and robot agents should align with the intentions of their human designers. Alignment is necessary for AI systems to be used in many sectors of the economy, and so the process of aligning AI systems becomes critical to study for defining effective AI policy. Toward this goal, people must be able to easily specify, inspect, and model agent behaviors. For specifications, we will consider expert-written reward functions for reinforcement learning (RL) and non-expert preferences for reinforcement learning from human feedback (RLHF). I will show evidence that experts are bad at writing reward functions: even in a trivial setting, experts write specifications that are overfit to a particular RL algorithm, and they often write erroneous specifications for agents that fail to encode their true intent. I will also show that the common approach to learning a reward function from non-experts in RLHF uses an inductive bias that fails to encode how humans express preferences, and that our proposed bias better encodes human preferences both theoretically and empirically. I will discuss the policy implications: namely, that engineers' design processes and embedded assumptions in building AI must be considered. For inspection, humans must be able to assess the behaviors an agent learns from a given specification. I will discuss a method to find settings that exhibit particular behaviors, like out-of-distribution failures. I will discuss the policy implications for testing AI systems, for example through red teaming. Lastly, cognitive science theories attempt to show how people build conceptual models that explain agent behaviors. I will show evidence that some of these theories are used in research to support humans, but that we can still build better curricula for modeling. I will discuss the policy need for careful onboarding to AI systems. I will end by discussing my current work in the U.S. Senate on responding to the proliferation of AI. Collectively, my research provides evidence that—even with the best of intentions— current human-AI systems often fail to induce alignment, and my research proposes promising directions for how to build better aligned human-AI systems.