The Achilles Heel Hypothesis: Pitfalls for AI Systems via Decision Theoretic Adversaries
As progress in AI continues to advance at a rapid pace, it is crucial to know how advanced systems will make choices and in what ways they may fail. Machines can already outsmart humans in some domains, and understanding how to safely build systems which may have capabilities at or above the human level is of particular concern. One might suspect that superhumanly-intelligent systems should be modeled as as something which humans, by definition, can't outsmart. However, as a challenge to this assumption, this paper presents the Achilles Heel hypothesis which states that highly-effective goal-oriented systems – even ones that are potentially superintelligent – may nonetheless have stable decision theoretic delusions which cause them to make obviously irrational decisions in adversarial settings. In a survey of relevant dilemmas and paradoxes from the decision theory literature, a number of these potential Achilles Heels are discussed in context of this hypothesis. Several novel contributions are made involving the ways in which these weaknesses could be implanted into a system.
READ FULL TEXT