Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty

2308.02019

Authors

Inar Timiryasov,Jean-Loup Tastet

Abstract

We present our submission to the BabyLM challenge, whose goal was to improve the sample efficiency of language models. We trained an ensemble consisting of a GPT-2 and small LLaMA models on the developmentally-plausible, 10M-word BabyLM dataset, then distilled it into a small, 58M-parameter LLaMA model, which exceeds in performance both of its teachers as well as a similar model trained without distillation.

This suggests that distillation can not only retain the full performance of the teacher model when the latter is trained on a sufficiently small dataset; it can exceed it, and lead to significantly better performance than direct training.

Resources

View on Hugging Face Read PDF ArXiv

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Tools

Details

takara.ai
Custom AI and machine learning from the Frontier Research Team.
Content is sourced from third-party publications.