Self play in RL is signal enough that machines can learn on their own. How we tr...

joe_the_user · on June 14, 2023

The difference between self-play in a game such as go and training an LLM on the output of itself or a previous LLM seems fairly obvious.

In self-play, the objective measure of the "truth" of a move can ultimately come out of the rules of the game which any machine can compute.

With an LLM, the machine is only aping, emulating, simulating the output the people have produced about the world - the machine has no access to actual "real world" that people are using language to describe. Human beings talking about the world is data that increases your own knowledge of the world - your own predictions of that talking, not so much.

blovescoffee · on June 14, 2023

The success of the newest GPT models relies on RL to refine the latent space inside the LLM. There's a bottleneck when using humans to refine that space. The next model or subsequent models will surely use RL techniques like self-play to break through that bottleneck.

joe_the_user · on June 15, 2023

The success of the newest GPT models relies on RL to refine the latent space inside the LLM. There's a bottleneck when using humans to refine that space.

Human based RL is used because humans know stuff about the real world and can sort language utterances by this. There's "self play" process that gives a system this sort of knowledge.