The discrepancy between the goals we want an AI system to internalize, and the goals it ends up pursuing is the rule, not the exception. In fact, we literally don’t know how to train an AI model to reliably pursue the objectives that we actually want them to pursue in all situations – all we can ever do is train them to pursue proxies for our real objectives.
This can have increasingly serious consequences as we increase the model’s capabilities, its ability to interact with the world, and the stakes of its deployment.Â