DS1 spectrogram: Zero-Shot Coreset Selection via Iterative Subspace Sampling

Zero-Shot Coreset Selection via Iterative Subspace Sampling

2411.15349

Authors

Brent A. Griffin,Jacob Marks,Jason J. Corso

Abstract

Deep learning increasingly relies on massive data with substantial storage, annotation, and training costs. To reduce costs, coreset selection finds a representative subset of data to train models while ideally performing on par with the full data training.

To maximize performance, current state-of-the-art coreset methods select data using dataset-specific ground truth labels and training. However, these methodological requirements prevent selection at scale on real-world, unlabeled data.

To that end, this paper addresses the selection of coresets that achieve state-of-the-art performance but without using any labels or training on candidate data. Instead, our solution, Zero-Shot Coreset Selection via Iterative Subspace Sampling (ZCore), uses previously-trained foundation models to generate zero-shot, high-dimensional embedding spaces to interpret unlabeled data.

ZCore then iteratively quantifies the relative value of all candidate data based on coverage and redundancy in numerous subspace distributions. Finally, ZCore selects a coreset sized for any data budget to train downstream models.

We evaluate ZCore on four datasets and outperform several state-of-the-art label-based methods, especially at low data rates that provide the most substantial cost reduction. On ImageNet, ZCore selections for 10% training data achieve a downstream validation accuracy of 53.99%, which outperforms prior label-based methods and removes annotation and training costs for 1.15 million images.

Our paper's code is publicly available at https://github.com/voxel51/zcore.

Resources

Stay in the loop

Every AI paper that matters, free in your inbox daily.

Details

  • © 2026 takara.ai Ltd
  • Content is sourced from third-party publications.