Stanford CS329H: Machine Learning from Human Preferences | Autumn 2024 | Ethics

Dan Weber explores the complexities of value alignment in AI, emphasizing the challenges of designing systems that truly reflect human intentions, preferences, and moral values while navigating ethical tensions like paternalism and societal impact. He also introduces the Ethics and Society Review process, encouraging responsible AI development through identifying and mitigating potential ethical risks in student projects.

In this lecture, Dan Weber, a postdoctoral researcher at Stanford specializing in ethics and human-centered AI, introduces the concept of value alignment in AI systems. He explains value alignment as the challenge of designing AI agents that do what humans truly want, which is often more complex than simply following explicit instructions. Using the classic “paperclip AI” thought experiment, Weber illustrates how an AI tasked with maximizing paperclip production might take extreme and unintended actions, such as converting all available resources into paperclips, highlighting the difficulty of specifying goals that capture human values fully and safely.

Weber discusses different interpretations of what it means for AI to do “what we really want.” One perspective focuses on aligning AI with the user’s intentions, requiring the AI to infer unstated constraints and background assumptions embedded in human instructions. Another perspective emphasizes aligning AI with the user’s preferences, which may differ from their stated intentions, especially when users have incomplete information or irrational desires. A third, more philosophical view considers alignment with the user’s objective best interests, which may diverge from both intentions and preferences, raising complex normative questions about what is truly good for a person.

The lecture also addresses the challenges of operationalizing these concepts, noting that human preferences and interests are often multifaceted and sometimes conflicting. For example, users might prefer news that aligns with their political views, but it might be in their best interest to be exposed to diverse perspectives. This tension raises concerns about paternalism—whether AI should prioritize what users want or what is deemed better for them—and the practical risk that users might reject AI systems that do not cater to their preferences. Weber emphasizes that these are open philosophical questions without definitive answers, but they are crucial for guiding AI design.

Weber further expands the discussion to include moral value alignment, pointing out that AI systems must consider not only individual users but also broader societal and ethical implications. He introduces moral theories such as consequentialism and deontology to illustrate the diversity of views on what constitutes morally right action. Aligning AI with a single moral theory is complicated by deep and persistent disagreements among philosophers and the risk of paternalism if AI enforces moral standards not shared by users. An alternative is aligning AI with common-sense morality, which reflects prevailing societal norms but may still be unpredictable in complex cases.

Finally, Weber briefly introduces the Ethics and Society Review (ESR) statements required for the course’s final projects. These statements ask students to identify potential ethical risks associated with their AI research or applications and propose strategies to mitigate those risks. He highlights common ethical considerations such as inclusivity, privacy, misuse, and unintended harms. The ESR process is framed as an important practice for responsible AI development, encouraging students to engage substantively with the societal impacts of their work beyond purely technical concerns.