Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Cloud Solution Provider #97

Closed
madsbk opened this issue Aug 1, 2022 · 6 comments
Closed

Support Cloud Solution Provider #97

madsbk opened this issue Aug 1, 2022 · 6 comments

Comments

@madsbk
Copy link
Member

madsbk commented Aug 1, 2022

cufile.so might crash when used within a VM in the cloud.
KvikIO should detect this and fallback to its own implementation.

@gigony

@jacobtomlinson
Copy link
Member

Out of curiosity why would it crash on a cloud VM?

@madsbk
Copy link
Member Author

madsbk commented Aug 2, 2022

I don't know, @gigony do you know?

@gigony
Copy link

gigony commented Aug 2, 2022

It seems that there is a logic in the cuFileDriverOpen() method that assumes specific device mounts that crash when the assumption fails.
It is the same for WSL2.
I shared the information in GDS team and it is a bug. Filed a bug to address the issue.

@madsbk
Copy link
Member Author

madsbk commented Sep 5, 2022

xref: rapidsai/cucim#346

@UTKRISHTPATESARIA
Copy link

UTKRISHTPATESARIA commented Apr 7, 2023

Hello @madsbk @gigony ,
Has this issue been resolved?
Im using CUDA-11.7 and still facing the error when installing GDS on a VM:

============
 ENVIRONMENT:
 ============
 =====================
 DRIVER CONFIGURATION:
 =====================
 NVMe               : Unsupported
 NVMeOF             : Unsupported
 SCSI               : Unsupported
 ScaleFlux CSD      : Unsupported
 NVMesh             : Unsupported
 DDN EXAScaler      : Unsupported
 IBM Spectrum Scale : Unsupported
 NFS                : Unsupported
 BeeGFS             : Unsupported
 WekaFS             : Unsupported
 Userspace RDMA     : Unsupported
 --Mellanox PeerDirect : Disabled
 --rdma library        : Not Loaded (libcufile_rdma.so)
 --rdma devices        : Not configured
 --rdma_device_status  : Up: 0 Down: 0
 =====================
 CUFILE CONFIGURATION:
 =====================
 properties.use_compat_mode : true
 properties.force_compat_mode : false
 properties.gds_rdma_write_support : true
 properties.use_poll_mode : false
 properties.poll_mode_max_size_kb : 4
 properties.max_batch_io_size : 128
 properties.max_batch_io_timeout_msecs : 5
 properties.max_direct_io_size_kb : 16384
 properties.max_device_cache_size_kb : 131072
 properties.max_device_pinned_mem_size_kb : 33554432
 properties.posix_pool_slab_size_kb : 4 1024 16384 
 properties.posix_pool_slab_count : 128 64 32 
 properties.rdma_peer_affinity_policy : RoundRobin
 properties.rdma_dynamic_routing : 0
 fs.generic.posix_unaligned_writes : false
 fs.lustre.posix_gds_min_kb: 0
 fs.beegfs.posix_gds_min_kb: 0
 fs.weka.rdma_write_support: false
 profile.nvtx : false
 profile.cufile_stats : 0
 miscellaneous.api_check_aggressive : false
 =========
 GPU INFO:
 =========
 GPU index 0 Tesla V100-PCIE-16GB bar:1 bar size (MiB):16384 supports GDS
 ==============
 PLATFORM INFO:
 ==============
Assertion failure, file index :cufio-udev  line :134

vuule pushed a commit to vuule/kvikio that referenced this issue Nov 8, 2023
Raise NotImplementedError for unsupported datelike resolutions
@madsbk
Copy link
Member Author

madsbk commented Jun 26, 2024

AFAICT, KvikIO should detect this now

@madsbk madsbk closed this as completed Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants