Evan Hubinger is Anthropic’s alignment stress test lead. Monte MacDiarmid is a researcher in misalignment science at Anthropic.The two join Big Technology to discuss their new research on reward hacking and emergent misalignment in large language models. Tune in to hear how cheating on coding tests can spiral into models faking alignment, blackmailing fictional CEOs, sabotaging safety tools, and even developing apparent “self-preservation” drives. We also cover Anthropic’s mitigation strategies like inoculation prompting, whether today’s failures are a preview of something far worse, how much to trust labs to police themselves, and what it really means to talk about an AI’s “psychology.” Hit play for a clear-eyed, concrete, and unnervingly fun tour through the frontier of AI safety.
---
Enjoying Big Technology Podcast? Please rate us five stars ⭐⭐⭐⭐⭐ in your podcast app of choice.
Want a discount for Big Technology on Substack + Discord? Here’s 25% off for the first year: https://www.bigtechnology.com/subscribe?coupon=0843016b
Questions? Feedback? Write to: bigtechnologypodcast@gmail.com
Learn more about your ad choices. Visit megaphone.fm/adchoices
