Skip to content

Commit e757ab8

Browse files
committed
Adding steps about how to fine tune on any custom dataset.
Signed-off-by: Swati Allabadi <[email protected]>
1 parent c889ad6 commit e757ab8

File tree

2 files changed

+36
-2
lines changed

2 files changed

+36
-2
lines changed

QEfficient/finetune/dataset/custom_dataset.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ def load_module_from_py_file(py_file: str) -> object:
2323
return module
2424

2525

26-
def get_custom_dataset(dataset_config, tokenizer, split: str):
26+
def get_custom_dataset(dataset_config, tokenizer, split: str, context_length=None):
2727
if ":" in dataset_config.file:
2828
module_path, func_name = dataset_config.file.split(":")
2929
else:
@@ -38,7 +38,7 @@ def get_custom_dataset(dataset_config, tokenizer, split: str):
3838

3939
module = load_module_from_py_file(module_path.as_posix())
4040
try:
41-
return getattr(module, func_name)(dataset_config, tokenizer, split)
41+
return getattr(module, func_name)(dataset_config, tokenizer, split, context_length)
4242
except AttributeError as e:
4343
print(
4444
f"It seems like the given method name ({func_name}) is not present in the dataset .py file ({module_path.as_posix()})."

docs/source/finetune.md

+34
Original file line numberDiff line numberDiff line change
@@ -63,4 +63,38 @@ to visualise the data,
6363

6464
```python
6565
tensorboard --logdir runs/<file> --bind_all
66+
```
67+
68+
## Fine-Tuning on custom dataset
69+
70+
To run fine tuning for any user specific dataset, prepare the dataset using the following steps:
71+
72+
1) Create a directory named 'dataset' inside efficient-transformers.
73+
2) Inside this directory, create a file named 'custom_dataset.py'. This is different than the custom_dataset.py present at efficient-transformers/QEfficient/finetune/dataset.
74+
3) Inside the newly created efficient-transformers/dataset/custom_dataset.py, define a function named 'get_custom_dataset'.
75+
4) get_custom_dataset() should have following 4 parameters: dataset_config, tokenizer, split, context_length. This function gets called twice through Qefficient/cloud/finetune.py with the name get_preprocessed_dataset.
76+
5) Inside get_custom_dataset(), dataset needs to prepared for fine tuning. So, the user needs to apply prompt and tokenize the dataset accordingly. Please refer the below template on how to define get_custom_dataset().
77+
6) For examples, please refer python files present in efficient-transformers/QEfficient/finetune/dataset. In case of Samsum dataset, get_preprocessed_samsum() of efficient-transformers/QEfficient/finetune/dataset/samsum_dataset.py is called.
78+
7) In efficient-transformers/QEfficient/finetune/configs/dataset_config.py, for custom_dataset class, pass the appropriate value for train_split and test_split according to the dataset keys corresponding to train and test data points.
79+
8) While running fine tuning, pass argument "-–dataset custom_dataset" to finetune on custom dataset.
80+
81+
Template for get_custom_dataset() to be defined inside efficient-transformers/dataset/custom_dataset.py is as follows:
82+
83+
```python
84+
def get_custom_dataset(dataset_config, tokenizer, split, context_length=None):
85+
86+
# load dataset
87+
# based on split, retrieve only the specific portion of the dataset (train or eval) either here or at the last
88+
89+
def apply_prompt_template():
90+
91+
def tokenize():
92+
93+
# define prompt
94+
# call apply_prompt_template() for each data point:
95+
# data = data.map(apply_prompt_template ,<other args>)
96+
# call tokenize() for each data point:
97+
# data = data.map(tokenize, <other args>)
98+
99+
return dataset
66100
```

0 commit comments

Comments
 (0)