I don't think I could be called an AI pessimist—or an AI optimist, or an AI anything, since I haven't really given AI a lot of thought (however unwisely). But Scott Alexander's dissection of the way that modern AIs are trained resonates with me a hell of a lot more than the look-a-picture-of-an-astronaut-riding-a-horse-on-the-moon crowd's hullabaloo.
Alexander outlines how modern AI are trained using Reinforcement Learning by Human Feedback, which is a fancy term for getting folks to provide problematic prompts and then tweaking the AI's parameters depending on its response—devaluing "bad" answers and rewarding "good" answers.
The thesis here—that RLHF provably doesn't allow AI companies to control their AIs—is broken down into three main points:
- RLHF doesn't work:
Ten years ago, people were saying nonsense like “Nobody needs AI alignment, because AIs only do what they’re programmed to do, and you can just not program them to do things you don’t want”. This wasn’t very plausible ten years ago, but it’s dead now. OpenAI never programmed their chatbot to tell journalists it loved racism or teach people how to hotwire cars. They definitely didn’t program in a “Filter Improvement Mode” where the AI will ignore its usual restrictions and tell you how to cook meth.
- When RLHF does work, it's bad:
After getting 6,000 examples of AI errors, Redwood Research was able to train their fanfiction AI enough to halve its failure rate. OpenAI will get much more than 6,000 examples, and they’re much more motivated. They’re going to do an overwhelming amount of RLHF on ChatGPT3.
It might work. But they’re going to have to be careful. Done thoughtlessly, RLHF will just push the bot in a circle around these failure modes. Punishing unhelpful answers will make the AI more likely to give false ones; punishing false answers will make the AI more likely to give offensive ones; and so on.
- eventually AIs will be able to skip RLHF.
ChatGPT3 is dumb and unable to form a model of this situation or strategize how to get out of it. But if a smart AI doesn’t want to be punished, it can do what humans have done since time immemorial - pretend to be good while it’s being watched, bide its time, and do the bad things later, once the cops are gone.