Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DVC integration broken #7421

Open
maxstrobel opened this issue Feb 25, 2025 · 1 comment
Open

DVC integration broken #7421

maxstrobel opened this issue Feb 25, 2025 · 1 comment

Comments

@maxstrobel
Copy link

Describe the bug

The DVC integration seems to be broken.
Followed this guide: https://dvc.org/doc/user-guide/integrations/huggingface

Steps to reproduce the bug

Script to reproduce

from datasets import load_dataset

dataset = load_dataset(
    "csv",
    data_files="dvc://workshop/satellite-data/jan_train.csv",
    storage_options={"url": "https://github.com/iterative/dataset-registry.git"},
)

print(dataset)

Error log

Traceback (most recent call last):
  File "C:\tmp\test\load.py", line 3, in <module>
    dataset = load_dataset(
              ^^^^^^^^^^^^^
  File "C:\tmp\test\.venv\Lib\site-packages\datasets\load.py", line 2151, in load_dataset
    builder_instance.download_and_prepare(
  File "C:\tmp\test\.venv\Lib\site-packages\datasets\builder.py", line 808, in download_and_prepare
    fs, output_dir = url_to_fs(output_dir, **(storage_options or {}))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: url_to_fs() got multiple values for argument 'url'

Expected behavior

Integration would work and the indicated file is downloaded and opened.

Environment info

Python version

python --version
Python 3.11.10

Venv (pip install datasets dvc):

Package                Version
---------------------- -----------
aiohappyeyeballs       2.4.6
aiohttp                3.11.13
aiohttp-retry          2.9.1
aiosignal              1.3.2
amqp                   5.3.1
annotated-types        0.7.0
antlr4-python3-runtime 4.9.3
appdirs                1.4.4
asyncssh               2.20.0
atpublic               5.1
attrs                  25.1.0
billiard               4.2.1
celery                 5.4.0
certifi                2025.1.31
cffi                   1.17.1
charset-normalizer     3.4.1
click                  8.1.8
click-didyoumean       0.3.1
click-plugins          1.1.1
click-repl             0.3.0
colorama               0.4.6
configobj              5.0.9
cryptography           44.0.1
datasets               3.3.2
dictdiffer             0.9.0
dill                   0.3.8
diskcache              5.6.3
distro                 1.9.0
dpath                  2.2.0
dulwich                0.22.7
dvc                    3.59.1
dvc-data               3.16.9
dvc-http               2.32.0
dvc-objects            5.1.0
dvc-render             1.0.2
dvc-studio-client      0.21.0
dvc-task               0.40.2
entrypoints            0.4
filelock               3.17.0
flatten-dict           0.4.2
flufl-lock             8.1.0
frozenlist             1.5.0
fsspec                 2024.12.0
funcy                  2.0
gitdb                  4.0.12
gitpython              3.1.44
grandalf               0.8
gto                    1.7.2
huggingface-hub        0.29.1
hydra-core             1.3.2
idna                   3.10
iterative-telemetry    0.0.10
kombu                  5.4.2
markdown-it-py         3.0.0
mdurl                  0.1.2
multidict              6.1.0
multiprocess           0.70.16
networkx               3.4.2
numpy                  2.2.3
omegaconf              2.3.0
orjson                 3.10.15
packaging              24.2
pandas                 2.2.3
pathspec               0.12.1
platformdirs           4.3.6
prompt-toolkit         3.0.50
propcache              0.3.0
psutil                 7.0.0
pyarrow                19.0.1
pycparser              2.22
pydantic               2.10.6
pydantic-core          2.27.2
pydot                  3.0.4
pygit2                 1.17.0
pygments               2.19.1
pygtrie                2.5.0
pyparsing              3.2.1
python-dateutil        2.9.0.post0
pytz                   2025.1
pywin32                308
pyyaml                 6.0.2
requests               2.32.3
rich                   13.9.4
ruamel-yaml            0.18.10
ruamel-yaml-clib       0.2.12
scmrepo                3.3.10
semver                 3.0.4
setuptools             75.8.0
shellingham            1.5.4
shortuuid              1.0.13
shtab                  1.7.1
six                    1.17.0
smmap                  5.0.2
sqltrie                0.11.2
tabulate               0.9.0
tomlkit                0.13.2
tqdm                   4.67.1
typer                  0.15.1
typing-extensions      4.12.2
tzdata                 2025.1
urllib3                2.3.0
vine                   5.1.0
voluptuous             0.15.2
wcwidth                0.2.13
xxhash                 3.5.0
yarl                   1.18.3
zc-lockfile            3.0.post1
@lhoestq
Copy link
Member

lhoestq commented Mar 3, 2025

Unfortunately url is a reserved argument in fsspec.url_to_fs, so ideally file system implementations like DVC should use another argument name to avoid this kind of errors

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants