Recent years have witnessed a growing demand for visual content, such as 2D images and multi-frame videos, in fields like computational photography, virtual reality, gaming, and the film industry. In response to those demands, various generative models, including VQVAE, GAN, and Diffusion models, have been proposed to facilitate visual content generation from noise or text. However, it remains an open challenge to adopt these models for more practical image-to-image generation, also known as image processing and editing. This thesis explores the paradigm of image editing with generative models, with a focus on the foundational models from large-scale pretraining. We begin this thesis by exploring real-time image rescaling. Images from modern cameras can reach 6K resolution, but they al...[ Read more ]