Self-correcting LLM-controlled Diffusion Models

CVPR 2024

UC Berkeley
UC Berkeley
UC Berkeley
UC Berkeley
UC Berkeley

TL;DR: The Self-correcting LLM-controlled Diffusion (SLD) Framework features:
  1. Self-correction: Enhances generative models with LLM-integrated detectors for precise text-to-image alignment.
  2. Unified Generation and Editing: Excels at both image generation and fine-grained editing.
  3. Universal Compatibility: Works with ANY image generator, like DALL-E 3, requiring no extra training or data.

Self-correcting LLM-conrolled Diffusion Models (SLD)

SLD enhances text-to-image alignment through an iterative self-correction process. It begins with LLM-driven object detection, and subsequently performs LLM-controlled analysis and training-free correction.

SLD Unifies Image Generation and Editing!

Visualizations: Correcting Various Generative Models

SLD improves text-to-image alignment in models like SDXL, LMD+, and DALL-E 3. The first row shows SLD's precision in placing a blue bicycle between a bench and palm tree, with an accurate count of palm trees and seagulls. The second row demonstrates SLD's efficacy in complex scenes, preventing object collision with training-free latent operations.

Visualizations: Complex Object-level Editing

SLD can handle a diverse array of image editing tasks guided by natural, human-like instructions. Its capabilities span from adjusting object counts to altering object attributes, positions, and sizes.

Citation

If you use this work or find it helpful, please consider citing:

@article{wu2023self,
  title={Self-correcting LLM-controlled Diffusion Models},
  author={Wu, Tsung-Han and Lian, Long and Gonzalez, Joseph E and Li, Boyi and Darrell, Trevor},
  journal={arXiv preprint arXiv:2311.16090},
  year={2023}
}

@article{lian2023llmgrounded,
    title={LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models}, 
    author={Lian, Long and Li, Boyi and Yala, Adam and Darrell, Trevor},
    journal={arXiv preprint arXiv:2305.13655},
    year={2023}
}
                

Credit: The design of this project page references the project pages of LVD, LMD, NeRF, DeepMotionEditing, and LERF.