In this example, we'll be training a Kwai Kolors model using the SimpleTuner toolkit and will be using the lora model type.
Kolors is roughly the same size as SDXL, so you can try full training, but the changes for that are not described in this quickstart guide.
Make sure that you have python installed; SimpleTuner does well with 3.10 through 3.12.
You can check this by running:
python --versionIf you don't have python 3.12 installed on Ubuntu, you can try the following:
apt -y install python3.12 python3.12-venvFor Vast, RunPod, and TensorDock (among others), the following will work on a CUDA 12.2-12.8 image to enable compiling of CUDA extensions:
apt -y install nvidia-cuda-toolkitInstall SimpleTuner via pip:
pip install simpletuner[cuda]For manual installation or development setup, see the installation documentation.
To run SimpleTuner, you will need to set up a configuration file, the dataset and model directories, and a dataloader configuration file.
An experimental script, configure.py, may allow you to entirely skip this section through an interactive step-by-step configuration. It contains some safety features that help avoid common pitfalls.
Note: This doesn't configure your dataloader. You will still have to do that manually, later.
To run it:
simpletuner configure
⚠️ For users located in countries where Hugging Face Hub is not readily accessible, you should addHF_ENDPOINT=https://hf-mirror.comto your~/.bashrcor~/.zshrcdepending on which$SHELLyour system uses.
If you prefer to manually configure:
Copy config/config.json.example to config/config.json:
cp config/config.json.example config/config.jsonThe following must be executed for an AMD MI300X to be useable:
apt install amd-smi-lib
pushd /opt/rocm/share/amd_smi
python3 -m pip install --upgrade pip
python3 -m pip install .
popdThere, you will need to modify the following variables:
{
"model_type": "lora",
"model_family": "kolors",
"pretrained_model_name_or_path": "Kwai-Kolors/Kolors-diffusers",
"output_dir": "/home/user/output/models",
"validation_resolution": "1024x1024,1280x768",
"validation_guidance": 3.4,
"use_gradient_checkpointing": true,
"learning_rate": 1e-4
}pretrained_model_name_or_path- Set this toKwai-Kolors/Kolors-diffusers.MODEL_TYPE- Set this tolora.USE_DORA- Set this totrueif you wish to train DoRA.MODEL_FAMILY- Set this tokolors.OUTPUT_DIR- Set this to the directory where you want to store your checkpoints and validation images. It's recommended to use a full path here.VALIDATION_RESOLUTION- Set this to1024x1024for this example.- Additionally, Kolors was fine-tuned on multi-aspect buckets, and other resolutions may be specified using commas to separate them:
1024x1024,1280x768
- Additionally, Kolors was fine-tuned on multi-aspect buckets, and other resolutions may be specified using commas to separate them:
VALIDATION_GUIDANCE- Use whatever value you are comfortable with for testing at inference time. Set this between4.2to6.4.USE_GRADIENT_CHECKPOINTING- This should probably betrueunless you have a LOT of VRAM and want to sacrifice some to make it go faster.LEARNING_RATE-1e-4is fairly common for low-rank networks, though1e-5might be a more conservative choice if you notice any "burning" or early overtraining.
There are a few more if using a Mac M-series machine:
mixed_precisionshould be set tono.attention_mechanismshould be set todiffusers, sincexformersand other values probably will not work.
Tested on Apple and NVIDIA systems, Hugging Face Optimum-Quanto can be used to reduce the precision and VRAM requirements of especially ChatGLM 6B (the text encoder).
For config.json:
{
"base_model_precision": "int8-quanto",
"text_encoder_1_precision": "no_change",
"optimizer": "adamw_bf16"
}For config.env users (deprecated):
# choices: int8-quanto, int4-quanto, int2-quanto, fp8-quanto
# int8-quanto was tested with a single subject dreambooth LoRA.
# fp8-quanto does not work on Apple systems. you must use int levels.
# int2-quanto is pretty extreme and gets the whole rank-1 LoRA down to about 13.9GB VRAM.
# may the gods have mercy on your soul, should you push things Too Far.
export TRAINER_EXTRA_ARGS="--base_model_precision=int8-quanto"
# Maybe you want the text encoders to remain full precision so your text embeds are cake.
# We unload the text encoders before training, so, that's not an issue during training time - only during pre-caching.
# Alternatively, you can go ham on quantisation here and run them in int4 or int8 mode, because no one can stop you.
export TRAINER_EXTRA_ARGS="${TRAINER_EXTRA_ARGS} --text_encoder_1_precision=no_change"
# When you're quantising the model, --base_model_default_dtype is set to bf16 by default. This setup requires adamw_bf16, but saves the most memory.
# adamw_bf16 only supports bf16 training, but any other optimiser will support both bf16 or fp32 training precision.
export OPTIMIZER="adamw_bf16"It's crucial to have a substantial dataset to train your model on. There are limitations on the dataset size, and you will need to ensure that your dataset is large enough to train your model effectively. Note that the bare minimum dataset size is TRAIN_BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS. The dataset will not be discoverable by the trainer if it is too small.
Depending on the dataset you have, you will need to set up your dataset directory and dataloader configuration file differently. In this example, we will be using pseudo-camera-10k as the dataset.
In your OUTPUT_DIR directory, create a multidatabackend.json:
[
{
"id": "pseudo-camera-10k-kolors",
"type": "local",
"crop": true,
"crop_aspect": "square",
"crop_style": "random",
"resolution": 1.0,
"minimum_image_size": 0.25,
"maximum_image_size": 1.0,
"target_downsample_size": 1.0,
"resolution_type": "area",
"cache_dir_vae": "cache/vae/kolors/pseudo-camera-10k",
"instance_data_dir": "/home/user/simpletuner/datasets/pseudo-camera-10k",
"disabled": false,
"skip_file_discovery": "",
"caption_strategy": "filename",
"metadata_backend": "discovery"
},
{
"id": "text-embeds",
"type": "local",
"dataset_type": "text_embeds",
"default": true,
"cache_dir": "cache/text/kolors/pseudo-camera-10k",
"disabled": false,
"write_batch_size": 128
}
]Then, create a datasets directory:
mkdir -p datasets
huggingface-cli download --repo-type=dataset bghira/pseudo-camera-10k --local-dir=datasets/pseudo-camera-10kThis will download about 10k photograph samples to your datasets/pseudo-camera-10k directory, which will be automatically created for you.
You'll want to login to WandB and HF Hub before beginning training, especially if you're using push_to_hub: true and --report_to=wandb.
If you're going to be pushing items to a Git LFS repository manually, you should also run git config --global credential.helper store
Run the following commands:
wandb loginand
huggingface-cli loginFollow the instructions to log in to both services.
From the SimpleTuner directory, one simply has to run:
bash train.shThis will begin the text embed and VAE output caching to disk.
For more information, see the dataloader and tutorial documents.
If you wish to enable evaluations to score the model's performance, see this document for information on configuring and interpreting CLIP scores.
If you wish to use stable MSE loss to score the model's performance, see this document for information on configuring and interpreting evaluation loss.