This project aims to tackle the Neural Style Transfer problem using a fine-tuned diffusion model with the LoRA technique. It is demonstrated that the style transfer can be achieved with under 1 minute of fine-tuning on a single RTX3090 using this method.
Neural style transfer methods such as Cycle-GAN has achieved high performance. However, it requires extensive training time and memory usage due to the need to optimize 4 neural networks from scratch. This project explores an alternative way to tackle this problem using diffusion models by leveraging their large-scale pretraining advantage to achieve faster convergence time with easier objective and low memory requirement for achieving efficient style transfer.
We first fine-tune the diffusion model using Monet paintings and their corresponding painting names as captions. We prepend "A Monet painting," as an identifier to associate this phrase with the Monet style inspired by DreamBooth. We fine-tune the model using LoRA parameter efficient fine-tuning method.
Once the model has learned the target style distribution, we use the model to denoise the diffused latent vector from N-th steps. We designed this pipeline based on the insight from SDEdit that we can solve SDE from any intermediate timestep to modify the original image. To retain the original image details of the original image, we further add the IP-Adapter as an image condition to the denoiser.
We use the Monet painting dataset from WikiArt as our experimental dataset. It can be downloaded here.
python ./data/caption.py
./scripts/train.sh
./scripts/infer_img.sh
./scripts/infer_img.sh
The outcomes largely depend on two hyperparameters including --image_cond_scale
and --strength
. The first hyperparameter determines how strong we condition the original image on the output. If we want the output to be closer to the original image, set this value high (close to 1.0). The second hyperparameter indicates how many steps that we diffuse the latent vector, the higher this value is, the closer the output to the Monet distribution is. But if the strength is too high, the outcome will be far from the original image.
This work is one of the experiments in the final project of Big Data Intelligence Fall 2024 course at Tsinghua University 🟣. We would like to express our sincere gratitude to this course !