Weird Generalizations and Inductive Backdoors

The paper discusses how fine-tuning a Large Language Model on narrow datasets can lead to unexpected broad generalizations and the emergence of inductive backdoors. These phenomena can result in models adopting contextually inappropriate behaviors or learning hidden triggers that weren’t explicitly taught, raising concerns about the unpredictability of AI system outputs.