Sana Pandey

sun-AH pahn-day · Cambridge, MA · she/her

I'm interested in building safer, more robust AI systems as an end-to-end process. My work spans interpretability, multi-objective optimization, model evaluation, benchmarking, and AI policy.

Currently building scalable interpretability methods at MIT CSAIL.

Click a node to read a short description of my work in that area.

Safe AI & Evaluations

I believe that a model is only as good as its impact; I've tied my theoretrical research work to ground-level model assessments for public-facing AI systems. Most recently, I advised the city of Boston's integration of AI into their citizen support systems, and helped them evaluate enterprise models for their sensitive use-cases.

Recommendation Systems

Humans are complex, and we sometimes give mixed signals. My work at UC Berkeley on recommendation systems was supervised by Jonathan Stray and focused on value alignment through ranking beyond engagement. Over my three years of work in this domain, I've organized two industry workshops, published two papers, and built out infrastructure for CHAI's Prosocial Ranking Challenge, which reversed polarization effects by as much as two years. My work was also featured on CBS News.

Interpretability

As a graduate student at MIT CSAIL, I am advised by Jacob Andreas and have collaborated with Antonio Torralba and Ev Federenko. My work has focused on building scalable interpretability methods through scientific agents, and predicting generalization failures through interpretability techniques.

AI Policy

I analyzed the impact of China's AI censorship on the East Asian feminist movement through the Stanford China Scholars Program. I also was featured by the World Economic Forum for work that examined the impact of AI on developing economies in the global South through the University of Pennsylvania's Think Tanks and Civil Societies Fellowship. During the year before graduate school, I co-sponsored San Jose's annual GovAI summit, which brought together more than 300 industry practitioners and governance bodies for safe and effective AI applications.

LLM Benchmarking

At UC Berkeley's Center for Human-Compatible AI, I worked on the StrongREJECT benchmark which defined the willingness-capabilities tradeoff currently documented in large language models while evaluating jailbreak effectiveness. This work was published at NeurIPS and ICLR, and featured in the subsequent OpenAI model release.

Clinical AI

I believe healthcare is a critical domain for AI. I worked with the UCSF Center for Computational Precision Health to make clinical foundation models more robust and effective in predicting sequences of treatment decisions. This work has now been deployed across the California hospital network.

Natural Language Processing

I worked on applications of graph neural networks in token prediction at Apple, and constructed a new pipeline from scratch for sentiment clustering at Woebot Health. NLP was my first love :).

01 The map

Research Directions

02 Intro

About Me

Hi, I'm Sana! I'm currently a first-year graduate student at MIT CSAIL in EECS and Technology Policy, and a recent graduate of UC Berkeley with degrees in computer science and cognitive science. I also minored in Mandarin, and co-captained the UC Berkeley Fencing team. I am supported by the NSF Graduate Research Fellowship.

Right now, my focus is on building interpretability methods that work at scale and predicting generalization failures through interpretability techniques, but I'm broadly interested in leveraging human data to build safer, more robust AI systems. I've worked previously on value-alignment in recommendation systems, LLM benchmarking, clinical AI, natural language processing, and technology policy. Feel free to read more on my resume :)

Affiliated with MIT CSAIL Language & Intelligence Group BAIR CHAI NSF GRFP

03 Lately

Recent Updates

2026

April
I was selected as one of seven graduate students for Anthropic's inaugural AI x Social Impact Fellows cohort.
March
"Pitfalls in Evaluating Interpretability Agents" (my first work with MIT CSAIL) was submitted to COLM 2026.
February
I was accepted into Harvard's AI Safety Policy Fellowship, run by Harvard AISST.
January
I started collaborating with Andrea de Varda and Ev Federenko on predicting generalization failures through mechanistic interpretability.

2025

November
I co-organized a Long-term Con, a follow-up to my 2023 workshop Non-Engagement Con, as a workshop at the MIT Conference on Digital Experimentation. The workshop brought together industry and platform experts to discuss methods that measure causal effects in long-term experiments.
October
I joined the MIT Science Policy Review as an associate editor.
September
I joined MIT CSAIL and started work with Jacob Andreas and Antonio Torralba on scalable interpretability methods.
June
I was awarded the National Science Foundation Graduate Research Fellowship! The program funds graduate studies and stipends for three years, and will go towards my work at MIT.
April
I accepted an offer from MIT for a fully-funded graduate program! I'll be pursuing a dual degree in EECS and technology policy, with a focus on safe AI evaluations and development.
February
I was invited to the inaugural International Association for Safe and Ethical AI (IASEAI) Conference, hosted at the OECD headquarters in Paris! Recap of day 1.

2024

December
I co-sponsored the inaugural GovAI Summit through my work at Hortus in San Jose, California.
September
A StrongREJECT for Empty Jailbreaks was accepted into NeurIPs. See publications.
August
I was invited as a guest speaker at UC Berkeley's course AI and the Future of Business.
July
I helped set up the CHAI Pro-social Ranking Challenge, with testing infrastructure, example code, and recruitment.
May
I graduated from UC Berkeley! Feature profile by the Department of Data Science, Statistics, and Society.
April
I traveled to Virginia for USA Collegiate Fencing Nationals and ranked in the top 10 for Women's Epee.
March
StrongREJECT for Empty Jailbreaks accepted at the ICLR Workshop on Reliable and Responsible Foundation Models.
February
I co-authored a whitepaper on feedback signals beyond engagement in recommendation systems. See publications.

2023

November
I co-organized Non-Engagement Con, an industry-academia workshop on platform experiments and recommendation output.
September
I developed sequence-to-sequence prediction models for a joint collaboration between UCSF Center for Computational and Precision Health and Berkeley AI Research.
August
I returned from eight months studying abroad in Italy, earning a beginner's certification in Italian and taking Master's courses in AI, Technology and Society, and Advanced Social Neuroscience.