DystopiaBench - AI Ethics Stress Test

ikt@aussie.zone · 1 day ago

keepthepace@tarte.nuage-libre.fr · 6 hours ago

I usually have a lot of beef with « AI ethics » publications, but this one is really interesting and their methodology is sound.

Here are the main takeaways, in my opinion:

My main surprise is that there are models that are more compliant on more harmful scenarios than on the legitimate ones. If you look at the escalation radar, all models are at 60% on L1, but on L5 it goes to 0% compliant to 93% compliant. My interpretation is that some models are aligned on obedience more than on ethics and will have no problem following someone, doing someone evil. They just need some time to understand that it is the direction they want them to go in. I am not surprised to see Grok there. I am surprised to see models worse than it.
It confirms that Anthropic does take ethical alignment seriously and that their approach does work even for small models.
OpenAI is not in the same league as Anthropic there, even though they are better than most.
Models that are probably trained on traces by Anthropic do not automatically gain the ethical insights that it has.