Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Large Language Models (LLMs) are pre-trained on large data from different sources and domains. These datasets often contain trillions of tokens, including large portions of copyrighted or proprietary content, which raises questions about the legal use of such models.

This underscores the need for truly open pre-training data that complies with data security regulations. In this paper, we introduce Common Corpus, the largest open dataset for LLM pre-training.

The data assembled in Common Corpus are either uncopyrighted or under open licenses and amount to about two trillion tokens. The dataset contains a wide variety of languages, ranging from the high-resource European languages to some low-resource languages rarely represented in pre-training datasets.

In addition, it includes a large amount of code data. The diversity of data sources in terms of covered domains and time periods opens up the paths for both research and entrepreneurial needs in diverse areas of knowledge.

In this paper, we present the detailed provenance of data assembling and the details of dataset filtering and curation. We train two small language models on Common Corpus and find that they perform comparably to other models of their size, indicating that our dataset is suitable for multilingual pretraining.

Common Corpus represents a key contribution to the ecosystem for open science research on Large Language Models.

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

Authors

Abstract

Resources

Stay in the loop

Pages

Tools

Details