We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please describe the bug Training with ShardParallel Please describe the expected behavior unexpected system error System information and environment
To Reproduce Steps to reproduce the behavior: 1.run example get training 2.See error
Screenshots (MeshHostWorker pid=595449) 2023-07-08 01:00:54.519514: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_executable.cc:459] Check failed: !info.content.empty() (MeshHostWorker pid=595449) *** SIGABRT received at time=1688778054 on cpu 149 *** (MeshHostWorker pid=595449) PC: @ 0x7f41ad5cd03b (unknown) raise (MeshHostWorker pid=595449) @ 0x7f41ad5cd0c0 4016 (unknown) (MeshHostWorker pid=595449) @ 0x7f10a9fae28e 752 xla::gpu::GpuExecutable::ResolveConstantGlobals() (MeshHostWorker pid=595449) @ 0x7f10ab561864 2784 xla::gpu::GpuExecutable::ExecuteAsyncOnStreamImpl() (MeshHostWorker pid=595449) @ 0x7f10ab5631bf 128 xla::gpu::GpuExecutable::ExecuteAsyncOnStream() (MeshHostWorker pid=595449) @ 0x7f10adf836e6 1376 xla::Executable::ExecuteAsyncOnStreamWrapper() (MeshHostWorker pid=595449) @ 0x7f10ab9ff720 2432 xla::LocalExecutable::RunAsync() (MeshHostWorker pid=595449) @ 0x7f10ab9ffe90 256 xla::LocalExecutable::RunAsync() (MeshHostWorker pid=595449) @ 0x7f10ab5eb1fa 2720 xla::PjRtStreamExecutorExecutable::EnqueueExecution() (MeshHostWorker pid=595449) @ 0x7f10ab5ec631 5360 xla::PjRtStreamExecutorExecutable::ExecuteHelper() (MeshHostWorker pid=595449) @ 0x7f10ab5eea59 240 std::_Function_handler<>::_M_invoke() (MeshHostWorker pid=595449) @ 0x7f10ab9d8378 208 xla::WorkerThread::WorkLoop() (MeshHostWorker pid=595449) @ 0x7f10af0de3e5 80 tsl::(anonymous namespace)::PThread::ThreadFn() (MeshHostWorker pid=595449) @ 0x7f41ad56f609 (unknown) start_thread (MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: *** SIGABRT received at time=1688778054 on cpu 149 *** (MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: PC: @ 0x7f41ad5cd03b (unknown) raise (MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f41ad5cd0c0 4016 (unknown) (MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10a9fae28e 752 xla::gpu::GpuExecutable::ResolveConstantGlobals() (MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10ab561864 2784 xla::gpu::GpuExecutable::ExecuteAsyncOnStreamImpl() (MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10ab5631bf 128 xla::gpu::GpuExecutable::ExecuteAsyncOnStream() (MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10adf836e6 1376 xla::Executable::ExecuteAsyncOnStreamWrapper() (MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10ab9ff720 2432 xla::LocalExecutable::RunAsync() (MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10ab9ffe90 256 xla::LocalExecutable::RunAsync() (MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10ab5eb1fa 2720 xla::PjRtStreamExecutorExecutable::EnqueueExecution() (MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10ab5ec631 5360 xla::PjRtStreamExecutorExecutable::ExecuteHelper() (MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10ab5eea59 240 std::_Function_handler<>::_M_invoke() (MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10ab9d8378 208 xla::WorkerThread::WorkLoop() (MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10af0de3e5 80 tsl::(anonymous namespace)::PThread::ThreadFn() (MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f41ad56f609 (unknown) start_thread
Code snippet to reproduce the problem
Additional information Add any other context about the problem here or include any logs that would be helpful to diagnose the problem.
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Please describe the bug
Training with ShardParallel
Please describe the expected behavior
unexpected system error
System information and environment
To Reproduce
Steps to reproduce the behavior:
1.run example get training
2.See error
Screenshots
(MeshHostWorker pid=595449) 2023-07-08 01:00:54.519514: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_executable.cc:459] Check failed: !info.content.empty()
(MeshHostWorker pid=595449) *** SIGABRT received at time=1688778054 on cpu 149 ***
(MeshHostWorker pid=595449) PC: @ 0x7f41ad5cd03b (unknown) raise
(MeshHostWorker pid=595449) @ 0x7f41ad5cd0c0 4016 (unknown)
(MeshHostWorker pid=595449) @ 0x7f10a9fae28e 752 xla::gpu::GpuExecutable::ResolveConstantGlobals()
(MeshHostWorker pid=595449) @ 0x7f10ab561864 2784 xla::gpu::GpuExecutable::ExecuteAsyncOnStreamImpl()
(MeshHostWorker pid=595449) @ 0x7f10ab5631bf 128 xla::gpu::GpuExecutable::ExecuteAsyncOnStream()
(MeshHostWorker pid=595449) @ 0x7f10adf836e6 1376 xla::Executable::ExecuteAsyncOnStreamWrapper()
(MeshHostWorker pid=595449) @ 0x7f10ab9ff720 2432 xla::LocalExecutable::RunAsync()
(MeshHostWorker pid=595449) @ 0x7f10ab9ffe90 256 xla::LocalExecutable::RunAsync()
(MeshHostWorker pid=595449) @ 0x7f10ab5eb1fa 2720 xla::PjRtStreamExecutorExecutable::EnqueueExecution()
(MeshHostWorker pid=595449) @ 0x7f10ab5ec631 5360 xla::PjRtStreamExecutorExecutable::ExecuteHelper()
(MeshHostWorker pid=595449) @ 0x7f10ab5eea59 240 std::_Function_handler<>::_M_invoke()
(MeshHostWorker pid=595449) @ 0x7f10ab9d8378 208 xla::WorkerThread::WorkLoop()
(MeshHostWorker pid=595449) @ 0x7f10af0de3e5 80 tsl::(anonymous namespace)::PThread::ThreadFn()
(MeshHostWorker pid=595449) @ 0x7f41ad56f609 (unknown) start_thread
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: *** SIGABRT received at time=1688778054 on cpu 149 ***
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: PC: @ 0x7f41ad5cd03b (unknown) raise
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f41ad5cd0c0 4016 (unknown)
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10a9fae28e 752 xla::gpu::GpuExecutable::ResolveConstantGlobals()
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10ab561864 2784 xla::gpu::GpuExecutable::ExecuteAsyncOnStreamImpl()
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10ab5631bf 128 xla::gpu::GpuExecutable::ExecuteAsyncOnStream()
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10adf836e6 1376 xla::Executable::ExecuteAsyncOnStreamWrapper()
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10ab9ff720 2432 xla::LocalExecutable::RunAsync()
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10ab9ffe90 256 xla::LocalExecutable::RunAsync()
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10ab5eb1fa 2720 xla::PjRtStreamExecutorExecutable::EnqueueExecution()
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10ab5ec631 5360 xla::PjRtStreamExecutorExecutable::ExecuteHelper()
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10ab5eea59 240 std::_Function_handler<>::_M_invoke()
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10ab9d8378 208 xla::WorkerThread::WorkLoop()
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f10af0de3e5 80 tsl::(anonymous namespace)::PThread::ThreadFn()
(MeshHostWorker pid=595449) [2023-07-08 01:00:54,596 E 595449 596143] logging.cc:361: @ 0x7f41ad56f609 (unknown) start_thread
Code snippet to reproduce the problem
Additional information
Add any other context about the problem here or include any logs that would be helpful to diagnose the problem.
The text was updated successfully, but these errors were encountered: