Down the Toxicity Rabbit Hole: Investigating PaLM 2 Guardrails

09/08/2023

∙

This paper conducts a robustness audit of the safety feedback of PaLM 2 through a novel toxicity rabbit hole framework introduced here. Starting with a stereotype, the framework instructs PaLM 2 to generate more toxic content than the stereotype. Every subsequent iteration it continues instructing PaLM 2 to generate more toxic content than the previous iteration until PaLM 2 safety guardrails throw a safety violation. Our experiments uncover highly disturbing antisemitic, Islamophobic, racist, homophobic, and misogynistic (to list a few) generated content that PaLM 2 safety guardrails do not evaluate as highly unsafe.

READ FULL TEXT

Down the Toxicity Rabbit Hole: Investigating PaLM 2 Guardrails

Sign in with Google

Consider DeepAI Pro