Counterfactual Planning in AGI Systems

01/29/2021
by   Koen Holtman, et al.
0

We present counterfactual planning as a design approach for creating a range of safety mechanisms that can be applied in hypothetical future AI systems which have Artificial General Intelligence. The key step in counterfactual planning is to use an AGI machine learning system to construct a counterfactual world model, designed to be different from the real world the system is in. A counterfactual planning agent determines the action that best maximizes expected utility in this counterfactual planning world, and then performs the same action in the real world. We use counterfactual planning to construct an AGI agent emergency stop button, and a safety interlock that will automatically stop the agent before it undergoes an intelligence explosion. We also construct an agent with an input terminal that can be used by humans to iteratively improve the agent's reward function, where the incentive for the agent to manipulate this improvement process is suppressed. As an example of counterfactual planning in a non-agent AGI system, we construct a counterfactual oracle. As a design approach, counterfactual planning is built around the use of a graphical notation for defining mathematical counterfactuals. This two-diagram notation also provides a compact and readable language for reasoning about the complex types of self-referencing and indirect representation which are typically present inside machine learning agents.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/10/2020

AGI Agent Safety by Iteratively Improving the Utility Function

While it is still unclear if agents with Artificial General Intelligence...
research
09/28/2020

Lockdown effects in US states: an artificial counterfactual approach

We adopt an artificial counterfactual approach to assess the impact of l...
research
10/18/2021

Goal Agnostic Planning using Maximum Likelihood Paths in Hypergraph World Models

In this paper, we present a hypergraph–based machine learning algorithm,...
research
08/05/2019

Corrigibility with Utility Preservation

Corrigibility is a safety property for artificially intelligent agents. ...
research
04/27/2022

Counterfactual harm

To act safely and ethically in the real world, agents must be able to re...
research
09/27/2019

Counterfactual States for Atari Agents via Generative Deep Learning

Although deep reinforcement learning agents have produced impressive res...
research
02/22/2021

Software Architecture for Next-Generation AI Planning Systems

Artificial Intelligence (AI) planning is a flourishing research and deve...

Please sign up or login with your details

Forgot password? Click here to reset