Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List of images behave differently on IterableDataset and Dataset #7461

Closed
FredrikNoren opened this issue Mar 17, 2025 · 2 comments
Closed

List of images behave differently on IterableDataset and Dataset #7461

FredrikNoren opened this issue Mar 17, 2025 · 2 comments

Comments

@FredrikNoren
Copy link

FredrikNoren commented Mar 17, 2025

Describe the bug

This code:

def train_iterable_gen():
        images = np.array(load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg").resize((128, 128)))
        yield {
            "images": np.expand_dims(images, axis=0),
            "messages": [
                {
                    "role": "user",
                    "content": [{"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" }]
                },
                {
                    "role": "assistant",
                    "content": [{"type": "text", "text": "duck" }]
                }
            ]
        }
    train_ds = Dataset.from_generator(train_iterable_gen,
                                      features=Features({
                                           'images': [datasets.Image(mode=None, decode=True, id=None)],
                                           'messages': [{'content': [{'text': datasets.Value(dtype='string', id=None), 'type': datasets.Value(dtype='string', id=None) }], 'role': datasets.Value(dtype='string', id=None)}]
                                           } )
                                           )

works as I'd expect; if I iterate the dataset then the images column returns a List[PIL.Image.Image], i.e. 'images': [<PIL.PngImagePlugin.PngImageFile image mode=RGB size=128x128 at 0x77EFB7EF4680>].

But if I change Dataset to IterableDataset, the images column changes into 'images': [{'path': None, 'bytes': ..]

Steps to reproduce the bug

The code above +

def load_image(url):
        response = requests.get(url)
        image = Image.open(io.BytesIO(response.content))
        return image

I'm feeding it to SFTTrainer

Expected behavior

Dataset and IterableDataset would behave the same

Environment info

requires-python = ">=3.12"
dependencies = [
    "av>=14.1.0",
    "boto3>=1.36.7",
    "datasets>=3.3.2",
    "docker>=7.1.0",
    "google-cloud-storage>=2.19.0",
    "grpcio>=1.70.0",
    "grpcio-tools>=1.70.0",
    "moviepy>=2.1.2",
    "open-clip-torch>=2.31.0",
    "opencv-python>=4.11.0.86; sys_platform == 'darwin'",
    "opencv-python-headless>=4.11.0.86; sys_platform == 'linux'",
    "pandas>=2.2.3",
    "pillow>=10.4.0",
    "plotly>=6.0.0",
    "py-spy>=0.4.0",
    "pydantic>=2.10.6",
    "pydantic-settings>=2.7.1",
    "pymysql>=1.1.1",
    "ray[data,default,serve,train,tune]>=2.43.0",
    "torch>=2.6.0",
    "torchmetrics>=1.6.1",
    "torchvision>=0.21.0",
    "transformers[torch]@git+https://github.com/huggingface/transformers",
    "wandb>=0.19.4",
    # https://github.com/Dao-AILab/flash-attention/issues/833
    "flash-attn @ https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl; sys_platform == 'linux'",
    "trl@https://github.com/huggingface/trl.git",
    "peft>=0.14.0",
]
@lhoestq
Copy link
Member

lhoestq commented Mar 17, 2025

Hi ! Can you try with datasets ^3.4 released recently ? on my side it works with IterableDataset on the recent version :)

In [20]: def train_iterable_gen():
    ...:         images = np.array(load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg").resize((128, 128)))
    ...:         yield {
    ...:             "images": np.expand_dims(images, axis=0),
    ...:             "messages": [
    ...:                 {
    ...:                     "role": "user",
    ...:                     "content": [{"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" }]
    ...:                 },
    ...:                 {
    ...:                     "role": "assistant",
    ...:                     "content": [{"type": "text", "text": "duck" }]
    ...:                 }
    ...:             ]
    ...:         }
    ...: 
    ...: train_ds = IterableDataset.from_generator(train_iterable_gen,
    ...:                                     features=Features({
    ...:                                         'images': [datasets.Image(mode=None, decode=True, id=None)],
    ...:                                         'messages': [{'content': [{'text': datasets.Value(dtype='string', id=None), 'type': datasets.Value(dtype='string', id=None) }],
    ...: 'role': datasets.Value(dtype='string', id=None)}]
    ...:                                         } )
    ...:                                         )


In [21]: 

In [21]: next(iter(train_ds))
/Users/quentinlhoest/hf/datasets/src/datasets/features/image.py:338: UserWarning: Downcasting array dtype int64 to uint8 to be compatible with 'Pillow'
  warnings.warn(f"Downcasting array dtype {dtype} to {dest_dtype} to be compatible with 'Pillow'")
Out[21]: 
{'images': [<PIL.PngImagePlugin.PngImageFile image mode=RGB size=128x128>],
 'messages': [{'content': [{'text': None, 'type': 'image'}], 'role': 'user'},
  {'content': [{'type': 'text', 'text': 'duck'}], 'role': 'assistant'}]}

@FredrikNoren
Copy link
Author

Hm I tried it here and it works as expected, even on datasets 3.3.2. I guess maybe something in the SFTTrainer is doing additional processing on the dataset, I'll have a look there.

Thanks @lhoestq!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants