Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Permissions error on /tmp/ir_dataset directory due to multiple users on the same server #206

Open
mitgosp opened this issue Aug 28, 2022 · 5 comments
Labels
bug Something isn't working

Comments

@mitgosp
Copy link

mitgosp commented Aug 28, 2022

Describe the bug
When more than one users on the same server or device use the ir_datasets to fetch documents, then the permission denied error might be encountered if one of the users does not have write access to the already created directory

Affected dataset(s)
This issue does not affect datasets

To Reproduce
Steps to reproduce the behavior:

  1. User A runs a script that imports some documents using ir_datasets
  2. User B who is on the same system performs the same actions
  3. User B is part of the Others group in the system and hence does not have write permissions to the already existing /tmp/ir_datasets directory
  4. User B sees the following error:
    PermissionError: [Errno 13] Permission denied: '/tmp/ir_datasets/tmp3sn3tbic'

Expected behavior
When multiple users are using the package on the same device, some additional checks would need to be in place to avoid permission errors. For example, the ir_directory directory that is created for tmp files could be prefaced by a username to avoid such conflicts.

Additional context
This issue can be bypassed by utilizing the IR_DATASETS_TMP environment variable.

@mitgosp mitgosp added the bug Something isn't working label Aug 28, 2022
@mitgosp mitgosp changed the title Permissions error on /tmp/ir_dataset directory due to multiple users on same server Permissions error on /tmp/ir_dataset directory due to multiple users on the same server Aug 28, 2022
@yuenherny
Copy link

I ran into same issue, but not quite sure if it is the same bug as yours @mitgosp

Tried running:

import ir_datasets
train = ir_datasets.load('msmarco-passage/dev/small')
for scoreddoc in train.scoreddocs_iter():
    scoreddoc

After download finished (45 mins), got this error:

[WARNING] Download failed: [WinError 5] Access is denied: 'C:\\Users\\USER\\AppData\\Local\\Temp\\ir_datasets\\tmpewbixzbf.tmp' -> 'C:\\Users\\USER\\AppData\\Local\\Temp\\ir_datasets\\tmpewbixzbf'
Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?826d5c40-2262-4c52-9508-d46bffe9a76c)
---------------------------------------------------------------------------
PermissionError                           Traceback (most recent call last)
File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\util\fileio.py:69, in Cache.verify(self)
     68 with self._streamer.stream() as stream:
---> 69     shutil.copyfileobj(stream, f)
     70 f.close() # close file before move... Needed because of Windows

File ~\AppData\Local\Programs\Python\Python310\lib\shutil.py:195, in copyfileobj(fsrc, fdst, length)
    194 while True:
--> 195     buf = fsrc_read(length)
    196     if not buf:

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\util\fileio.py:35, in IterStream.readinto(self, b)
     34 l = len(b) - pos  # We're supposed to return at most this much
---> 35 chunk = self.leftover or next(self.it)
     36 output, self.leftover = chunk[:l], chunk[l:]

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\datasets\msmarco_passage.py:52, in ExtractQidPid.__iter__(self)
     51 def __iter__(self):
---> 52     with self._streamer.stream() as stream:
     53         for line in _logger.pbar(stream, desc='extracting QID/PID pairs', unit='pair'):

File ~\AppData\Local\Programs\Python\Python310\lib\contextlib.py:135, in _GeneratorContextManager.__enter__(self)
    134 try:
--> 135     return next(self.gen)
...
-> 1206     self._accessor.unlink(self)
   1207 except FileNotFoundError:
   1208     if not missing_ok:

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\USER\\.ir_datasets\\msmarco-passage\\dev\\ms.run.tmp2'

P/s: Full cell output:

[INFO] Please confirm you agree to the MSMARCO data usage agreement found at <http://www.msmarco.org/dataset.aspx>
[INFO] If you have a local copy of https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz, you can symlink it here to avoid downloading it again: C:\Users\USER\.ir_datasets\downloads\8c140662bdf123a98fbfe3bb174c5831
[INFO] [starting] https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz
[INFO] [finished] https://msmarco.blob.core.windows.net/msmarcoranking/top1000.dev.tar.gz: [45:28] [687MB] [252kB/s]
[WARNING] Download failed: [WinError 5] Access is denied: 'C:\\Users\\USER\\AppData\\Local\\Temp\\ir_datasets\\tmpewbixzbf.tmp' -> 'C:\\Users\\USER\\AppData\\Local\\Temp\\ir_datasets\\tmpewbixzbf'
Output exceeds the [size limit](command:workbench.action.openSettings?[). Open the full output data [in a text editor](command:workbench.action.openLargeOutput?826d5c40-2262-4c52-9508-d46bffe9a76c)
---------------------------------------------------------------------------
PermissionError                           Traceback (most recent call last)
File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\util\fileio.py:69, in Cache.verify(self)
     68 with self._streamer.stream() as stream:
---> 69     shutil.copyfileobj(stream, f)
     70 f.close() # close file before move... Needed because of Windows

File ~\AppData\Local\Programs\Python\Python310\lib\shutil.py:195, in copyfileobj(fsrc, fdst, length)
    194 while True:
--> 195     buf = fsrc_read(length)
    196     if not buf:

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\util\fileio.py:35, in IterStream.readinto(self, b)
     34 l = len(b) - pos  # We're supposed to return at most this much
---> 35 chunk = self.leftover or next(self.it)
     36 output, self.leftover = chunk[:l], chunk[l:]

File d:\Repos\XpressAI\vecto-reranking\venv\lib\site-packages\ir_datasets\datasets\msmarco_passage.py:52, in ExtractQidPid.__iter__(self)
     51 def __iter__(self):
---> 52     with self._streamer.stream() as stream:
     53         for line in _logger.pbar(stream, desc='extracting QID/PID pairs', unit='pair'):

File ~\AppData\Local\Programs\Python\Python310\lib\contextlib.py:135, in _GeneratorContextManager.__enter__(self)
    134 try:
--> 135     return next(self.gen)
...
-> 1206     self._accessor.unlink(self)
   1207 except FileNotFoundError:
   1208     if not missing_ok:

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\USER\\.ir_datasets\\msmarco-passage\\dev\\ms.run.tmp2'

Screenshot:
image

@yuenherny
Copy link

@mitgosp Sorry if this is a stupid question, but how do I utilize the IR_DATASETS_TMP environment variable to bypass this issue?

@seanmacavaney
Copy link
Collaborator

Hi @yuenherny -- it looks like this is a different issue.

Do you have multiple processes open using ir_datasets? (E.g., multiple notebook instances)? As files are downloading, only a single process can access them on Windows.

@yuenherny
Copy link

Hi @seanmacavaney , thanks for the prompt response.

Nope, I guess the process is open because I tried to download multiple parts of the dataset - queries, scorreddocs, docs, qrels in sequence in my notebook - and when one hits an error, the process isn't closed automatically.

Now that I managed to download (after restarting my laptop), I get another error:

[INFO] Please confirm you agree to the MSMARCO data usage agreement found at <http://www.msmarco.org/dataset.aspx>
[INFO] If you have a local copy of https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz, you can symlink it here to avoid downloading it again: C:\Users\USER\.ir_datasets\downloads\31644046b18952c1386cd4564ba2ae69
[INFO] [starting] https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz
[INFO] [finished] https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz: [59:56] [954MB] [265kB/s]
[WARNING] Download failed: Expected md5 hash to be 31644046b18952c1386cd4564ba2ae69 but got 9a1336b80866927a64cd43a5d820f277

Possibly due to incomplete download?

@seanmacavaney
Copy link
Collaborator

and when one hits an error, the process isn't closed automatically

Gotcha -- thanks! This is a bug, as it should close the file in this case so others can use it. I'll look into fixing this.

Possibly due to incomplete download?

Yep, something went wrong with the download. It's not safe to use this version because the contents could be different, or you may be missing some records.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants