Look Ma, No RLHF!

VC Ramesh
May 24, 2023
1 min read

Superficial Alignment Hypothesis: A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users. If this hypothesis is correct, and alignment is largely about learning style, then a corollary of the Superficial Alignment Hypothesis is that one could sufficiently tune a pretrained language model with a rather small set of examples.

https://lnkd.in/ge3aNmeR

Look Ma, No RLHF!

Recent Posts

Comments