SLD enhances text-to-image alignment through an iterative self-correction process. It begins with LLM-driven object detection, and subsequently performs LLM-controlled analysis and training-free correction.
SLD improves text-to-image alignment in models like SDXL, LMD+, and DALL-E 3. The first row shows SLD's precision in placing a blue bicycle between a bench and palm tree, with an accurate count of palm trees and seagulls. The second row demonstrates SLD's efficacy in complex scenes, preventing object collision with training-free latent operations.
SLD can handle a diverse array of image editing tasks guided by natural, human-like instructions. Its capabilities span from adjusting object counts to altering object attributes, positions, and sizes.
If you use this work or find it helpful, please consider citing:
@article{wu2023self, title={Self-correcting LLM-controlled Diffusion Models}, author={Wu, Tsung-Han and Lian, Long and Gonzalez, Joseph E and Li, Boyi and Darrell, Trevor}, journal={arXiv preprint arXiv:2311.16090}, year={2023} } @article{lian2023llmgrounded, title={LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models}, author={Lian, Long and Li, Boyi and Yala, Adam and Darrell, Trevor}, journal={arXiv preprint arXiv:2305.13655}, year={2023} }
Credit: The design of this project page references the project pages of LVD, LMD, NeRF, DeepMotionEditing, and LERF.