-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
List of images behave differently on IterableDataset and Dataset #7461
Comments
Hi ! Can you try with In [20]: def train_iterable_gen():
...: images = np.array(load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg").resize((128, 128)))
...: yield {
...: "images": np.expand_dims(images, axis=0),
...: "messages": [
...: {
...: "role": "user",
...: "content": [{"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" }]
...: },
...: {
...: "role": "assistant",
...: "content": [{"type": "text", "text": "duck" }]
...: }
...: ]
...: }
...:
...: train_ds = IterableDataset.from_generator(train_iterable_gen,
...: features=Features({
...: 'images': [datasets.Image(mode=None, decode=True, id=None)],
...: 'messages': [{'content': [{'text': datasets.Value(dtype='string', id=None), 'type': datasets.Value(dtype='string', id=None) }],
...: 'role': datasets.Value(dtype='string', id=None)}]
...: } )
...: )
In [21]:
In [21]: next(iter(train_ds))
/Users/quentinlhoest/hf/datasets/src/datasets/features/image.py:338: UserWarning: Downcasting array dtype int64 to uint8 to be compatible with 'Pillow'
warnings.warn(f"Downcasting array dtype {dtype} to {dest_dtype} to be compatible with 'Pillow'")
Out[21]:
{'images': [<PIL.PngImagePlugin.PngImageFile image mode=RGB size=128x128>],
'messages': [{'content': [{'text': None, 'type': 'image'}], 'role': 'user'},
{'content': [{'type': 'text', 'text': 'duck'}], 'role': 'assistant'}]} |
Hm I tried it here and it works as expected, even on datasets 3.3.2. I guess maybe something in the SFTTrainer is doing additional processing on the dataset, I'll have a look there. Thanks @lhoestq! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
This code:
works as I'd expect; if I iterate the dataset then the
images
column returns aList[PIL.Image.Image]
, i.e.'images': [<PIL.PngImagePlugin.PngImageFile image mode=RGB size=128x128 at 0x77EFB7EF4680>]
.But if I change
Dataset
toIterableDataset
, theimages
column changes into'images': [{'path': None, 'bytes': ..]
Steps to reproduce the bug
The code above +
I'm feeding it to SFTTrainer
Expected behavior
Dataset and IterableDataset would behave the same
Environment info
The text was updated successfully, but these errors were encountered: