Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting
Su Wang · Chitwan Saharia · Ceslee Montgomery · Jordi Pont-Tuset · Shai Noy · Stefano Pellegrini · Yasumasa Onoe · Sarah Laszlo · David J. Fleet · Radu Soricut · Jason Baldridge · Mohammad Norouzi · Peter Anderson · William Chan
West Building Exhibit Halls ABC 180
Text-guided image editing can have a transformative impact in supporting creative applications. A key challenge is to generate edits that are faithful to the input text prompt, while consistent with the input image. We present Imagen Editor, a cascaded diffusion model, built by fine-tuning Imagen on text-guided image inpainting. Imagen Editor’s edits are faithful to the text prompts, which is accomplished by incorporating object detectors for proposing inpainting masks during training. In addition, text-guided image inpainting captures fine details in the input image by conditioning the cascaded pipeline on the original high resolution image. To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting. EditBench evaluates inpainting edits on natural and generated images exploring objects, attributes, and scenes. Through extensive human evaluation on EditBench, we find that object-masking during training leads to across-the-board improvements in text-image alignment -- such that Imagen Editor is preferred over DALL-E 2 and Stable Diffusion -- and, as a cohort, these models are better at object-rendering than text-rendering, and handle material/color/size attributes better than count/shape attributes.