3 Comments
Apr 19, 2023·edited Apr 19, 2023Liked by Jonathan Mann

This is interesting, and a good explanation for a naive reader on AI risks (eg me).

So here's another naive question: how difficult or undesirable for other reasons is to introduce a powerful but conceptually simple safety rail to the system, one that would override the risks with maximizing the goal X. I don't want to sound 100 years old (I'm slightly less) and bring up outdated Sci-Fi notions of let's say laws of robotics Asimov proposed in his Foundation novels, but something along the lines of "maximise profits BUT ONLY IF it doesn't cause a death or severe disability of any more people than would have died/got injured otherwise"? Maximise paperclip output BUT ONLY IF it doesn't kill anyone extra in the process?

Or perhaps, instead of maximising profits (paperclip output, the number of new drugs invented, the pace of scientific discovery as a whole, etc), set out a fixed goal that appears safer. Not: produce as many paperclips as possible, but produce X number of paperclips. Sure, X might be lower than what could be done, but is still likely high enough? In other words, artificial limit on the maximising, replacing it with a manually adjusted figure?

Expand full comment
author

Your suggestion of introducing a powerful but conceptually simple safety rail to AI systems is a valuable idea, and researchers are actively working on ways to implement safety measures like the ones you're suggesting. Personally, I think the approach you're suggesting can work and tried to implement a prototype version of your idea last year: https://github.com/ai-fail-safe. However, as you correctly pointed out with your Asimov reference, implementing such safety mechanisms can be more challenging than it first appears. In 2019, David Manheim suggested something similar to what you're advocating ( https://www.alignmentforum.org/posts/3RdvPS5LawYxLuHLH/hackable-rewards-as-a-safety-valve), but others pointed out that AI might want to make sure that no human can alter it's reward function back later, so it would still want to assume control. While I think Manheim's approach is correct, I can't rule out the idea that AI might want to safeguard its reward function. Similarly, Robert Miles has addressed potential problems with satisficing (setting an upper limit on the goal AI is trying to maximize). He has suggested that AI might want to procure resources / take over to make sure that it stays as close to the limit as it can. I'll reiterate that, in spite of these concerns, I think your approach can work, but when it comes to existential risk, every loophole is worth considering.

Expand full comment
author

Great question and your suggestion is a step in the right direction, but there are challenges with this approach. I'll write a more comprehensive response when I get a chance.

Expand full comment