DS1 spectrogram: Baby Llama: knowledge distillation from an ensemble of teachers trained
  on a small dataset with no performance penalty

Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty

2308.02019

Authors

Inar Timiryasov,Jean-Loup Tastet

Abstract

We present our submission to the BabyLM challenge, whose goal was to improve the sample efficiency of language models. We trained an ensemble consisting of a GPT-2 and small LLaMA models on the developmentally-plausible, 10M-word BabyLM dataset, then distilled it into a small, 58M-parameter LLaMA model, which exceeds in performance both of its teachers as well as a similar model trained without distillation.

This suggests that distillation can not only retain the full performance of the teacher model when the latter is trained on a sufficiently small dataset; it can exceed it, and lead to significantly better performance than direct training.

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • takara.ai
  • Custom AI and machine learning from the Frontier Research Team.
  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.