The seminar, led by Georgia Tech’s Polo Club of Data Science, explores methods for making AI systems safer and more interpretable, including dynamic safety shaping for large language models, interactive tools for tracing harmful outputs to training data, and visualization techniques for understanding complex models. The group also emphasizes the importance of accessible educational resources, showcasing explainer tools that help users and students grasp the inner workings of modern AI, ultimately advocating for responsible and transparent AI development.
The seminar, led by a Georgia Tech research group known as the Polo Club of Data Science, focuses on the intersection of artificial intelligence (AI) and human-computer interaction, with a particular emphasis on responsible and interpretable AI. The speaker highlights the importance of developing AI tools that are not only powerful but also safe, reliable, and understandable. This is especially critical as AI systems are increasingly deployed in high-stakes applications, such as autonomous vehicles and large language models (LLMs), where failures can have significant real-world consequences. The group’s work aims to make complex AI models more transparent and user-friendly, enabling users to better understand and trust AI decisions.
The first major topic discussed is AI safety, particularly in the context of large language models. The group introduces the concept of the “safety basin,” a visualization that shows how small changes in a model’s parameters—such as those made during fine-tuning—can abruptly and catastrophically break the model’s safety guardrails, causing it to generate harmful or offensive content. Their research demonstrates that this vulnerability is consistent across different models and fine-tuning methods. To address this, they propose “dynamic safety shaping,” a method that segments training data into safe and unsafe parts, allowing for more granular and robust safety controls during fine-tuning, rather than the traditional all-or-nothing approach.
The seminar then transitions to interpretable AI, emphasizing the need for scalable, interactive tools that help users make sense of complex models and large datasets. The group conducted a comprehensive survey of 76 papers at the intersection of AI interpretation and safety, identifying gaps in current research—such as the lack of tools for attributing unsafe outputs to specific training data. To fill this gap, they developed LM Attributor, an interactive tool that traces harmful model outputs back to the data points most responsible for them, enabling practitioners to identify and remove problematic training data. They also present VisMap, a tool for visualizing large-scale data embeddings, and Concept Attention, a method for generating high-quality saliency maps that reveal which parts of a generated image correspond to specific concepts, even if those concepts were not explicitly mentioned in the prompt.
The group’s research extends to the safety and interpretability of 3D generative models, such as 3D Gaussian Splatting. They demonstrate how these models can be manipulated to create view-dependent vulnerabilities—for example, making an object appear normal from one angle but disappear or change from another, which could have serious implications for applications like autonomous navigation. Their work underscores the importance of both understanding and defending against such vulnerabilities, advocating for a dual approach of knowing how to attack in order to better defend.
Finally, the seminar highlights the group’s commitment to AI education. They have developed a suite of interactive explainer tools—such as Transformer Explainer and Diffusion Explainer—that help students and practitioners understand the inner workings of modern AI models without requiring specialized hardware or software installations. These tools have been widely adopted, with over a million users globally, and are designed to bridge the gap between high-level overviews and detailed technical understanding. The speaker emphasizes the importance of effective educational resources for advancing responsible AI and encourages broader participation in developing such tools, noting both their impact and the ongoing need for innovation in AI interpretability and safety.