
Diff-Instruct*: Towards Human-Preferred One-step Text-to-image Generative Models
Authors
Abstract
We propose Diff-Instruct* (DI*), a data-efficient post-training approach for one-step text-to-image generative models to improve its human preferences without requiring image data. Our method frames alignment as online reinforcement learning from human feedback (RLHF), which optimizes the one-step model to maximize human reward functions while being regularized to be kept close to a reference diffusion process.
Unlike traditional RLHF approaches, which rely on the Kullback-Leibler divergence as the regularization, we introduce a novel general score-based divergence regularization that substantially improves performance as well as post-training stability. Although the general score-based RLHF objective is intractable to optimize, we derive a strictly equivalent tractable loss function in theory that can efficiently compute its gradient for optimizations. We introduce DI-SDXL-1step*, which is a 2.6B one-step text-to-image model at a resolution of $1024\times 1024$, post-trained from DMD2 w.r.t SDXL. Our 2.6B DI-SDXL-1step model outperforms the 50-step 12B FLUX-dev model* in ImageReward, PickScore, and CLIP score on the Parti prompts benchmark while using only 1.88% of the inference time. This result clearly shows that with proper post-training, the small one-step model is capable of beating huge multi-step diffusion models.
Our model is open-sourced at this link: https://github.com/pkulwj1994/diff_instruct_star. We hope our findings can contribute to human-centric machine learning techniques.