DS1 spectrogram: Parameterized Synthetic Text Generation with SimpleStories

Parameterized Synthetic Text Generation with SimpleStories

2504.09184

Authors

Thomas Dooms,Dan Braun,Chandan Sreedhara,Lennart Finke,Mat Allen

Abstract

We present SimpleStories, a large synthetic story dataset in simple language, consisting of 2 million samples each in English and Japanese. Through parameterizing prompts at multiple levels of abstraction, we achieve control over story characteristics at scale, inducing syntactic and semantic diversity.

Ablations on a newly trained model suite show improved sample efficiency and model interpretability compared to the TinyStories dataset. We open-source all constituent parts of model creation, hoping to enable novel ways to study the end-to-end training process.

As a byproduct, we move the frontier regarding the fewest-parameter language model that outputs grammatical natural language.

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.