Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Random Address size fault crashes on iOS devices #121

Open
tqtifnypmb opened this issue Aug 7, 2024 · 5 comments
Open

[Bug] Random Address size fault crashes on iOS devices #121

tqtifnypmb opened this issue Aug 7, 2024 · 5 comments

Comments

@tqtifnypmb
Copy link

tqtifnypmb commented Aug 7, 2024

Describe the issue:

There is a chance that mlx would causes Address size fault crashes while eval the graph on iOS devices. This issue barely occur on macOS devices.

Configuration:

Devices: iPhone/iPad
System Version: From iOS 16.7 to iOS 17.5
MLX-Swift version: 0.16.0

Crash Logs:

0   libsystem_kernel.dylib        	0x00000001ddd0742c __pthread_kill + 8 (:-1)
1   libsystem_pthread.dylib       	0x00000001f1aa6c0c pthread_kill + 268 (pthread.c:1721)
2   libsystem_c.dylib             	0x000000019cbaaba0 abort + 180 (abort.c:118)
3   libc++abi.dylib               	0x00000001f19c4ca4 abort_message + 132 (abort_message.cpp:78)
4   libc++abi.dylib               	0x00000001f19b4e40 demangling_terminate_handler() + 320 (cxa_default_handlers.cpp:72)
5   libobjc.A.dylib               	0x000000018cb21e3c _objc_terminate() + 160 (objc-exception.mm:499)
6   libc++abi.dylib               	0x00000001f19c4068 std::__terminate(void (*)()) + 16 (cxa_handlers.cpp:59)
7   libc++abi.dylib               	0x00000001f19c735c __cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) + 88 (cxa_exception.cpp:152)
8   libc++abi.dylib               	0x00000001f19c72a0 __cxa_throw + 308 (cxa_exception.cpp:283)
9   Test              	0x00000001047756f4 mlx::core::metal::Device::get_kernel_(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, MTL::Function const*) + 228 (device.cpp:396)
10  Test              	0x0000000104775ce8 mlx::core::metal::Device::get_kernel(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, MTL::Library*, std::__1::basic_string<char, std::__1::char_traits<c... + 188 (device.cpp:541)
11  Test              	0x0000000104789774 mlx::core::get_reduce_kernel(mlx::core::metal::Device&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_trai... + 824 (jit_kernels.cpp:353)
12  Test              	0x00000001047a72dc mlx::core::row_reduce_general_dispatch(mlx::core::array const&, mlx::core::array&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, mlx::core::ReductionP... + 1028 (reduce.cpp:178)
13  Test              	0x00000001047a96dc mlx::core::Reduce::eval_gpu(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>> const&, mlx::core::array&) + 1380 (reduce.cpp:627)
14  Test              	0x000000010479b4bc mlx::core::metal::make_task(mlx::core::array, bool)::$_1::operator()() + 256 (metal.cpp:66)
15  Test              	0x000000010479b4bc decltype(std::declval<mlx::core::metal::make_task(mlx::core::array, bool)::$_1&>()()) std::__1::__invoke[abi:ue170006]<mlx::core::metal::make_task(mlx::core::array, bool)::$_1&>(mlx::core::metal::m... + 256 (invoke.h:340)
16  Test              	0x000000010479b4bc void std::__1::__invoke_void_return_wrapper<void, true>::__call[abi:ue170006]<mlx::core::metal::make_task(mlx::core::array, bool)::$_1&>(mlx::core::metal::make_task(mlx::core::array, bool)::$_1&) + 256 (invoke.h:415)
17  Test              	0x000000010479b4bc std::__1::__function::__alloc_func<mlx::core::metal::make_task(mlx::core::array, bool)::$_1, std::__1::allocator<mlx::core::metal::make_task(mlx::core::array, bool)::$_1>, void ()>::operator()[abi:... + 256 (function.h:193)
18  Test              	0x000000010479b4bc std::__1::__function::__func<mlx::core::metal::make_task(mlx::core::array, bool)::$_1, std::__1::allocator<mlx::core::metal::make_task(mlx::core::array, bool)::$_1>, void ()>::operator()() + 308 (function.h:364)
19  Test              	0x000000010482644c std::__1::__function::__value_func<void ()>::operator()[abi:ue170006]() const + 20 (function.h:518)
20  Test              	0x000000010482644c std::__1::function<void ()>::operator()() const + 20 (function.h:1169)
21  Test              	0x000000010482644c mlx::core::scheduler::StreamThread::thread_fn() + 196 (scheduler.h:54)
22  Test              	0x000000010482663c decltype(*std::declval<mlx::core::scheduler::StreamThread*>().*std::declval<void (mlx::core::scheduler::StreamThread::*)()>()()) std::__1::__invoke[abi:ue170006]<void (mlx::core::scheduler::StreamT... + 28 (invoke.h:308)
23  Test              	0x000000010482663c void std::__1::__thread_execute[abi:ue170006]<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (mlx::core::scheduler::StreamThread::*)(), m... + 28 (thread.h:227)
24  Test              	0x000000010482663c void* std::__1::__thread_proxy[abi:ue170006]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, void (mlx::core::scheduler::Stream... + 72 (thread.h:238)
25  libsystem_pthread.dylib       	0x00000001f1aa506c _pthread_start + 136 (pthread.c:931)
26  libsystem_pthread.dylib       	0x00000001f1aa00d8 thread_start + 8 (:-1)

How to reproduce?

Don't know. This issue seems to occur occasionally.
Can't provide meaningful code neither.

@davidkoski
Copy link
Collaborator

This is throwing an exception in the evaluation thread -- this should have printed a message in the crash log. Did you see anything like that?

@tqtifnypmb
Copy link
Author

This is throwing an exception in the evaluation thread -- this should have printed a message in the crash log. Did you see anything like that?

This issue never occurred to me during development, so I did not see any printed message, for now I only have crash logs from iOS devices.

@davidkoski
Copy link
Collaborator

Near the top of the crash log it may have some more specific information. Something along these lines:

Crashed Thread:        1  Dispatch queue: setup-queue

Exception Type:        EXC_CRASH (SIGABRT)
Exception Codes:       0x0000000000000000, 0x0000000000000000

Termination Reason:    Namespace SIGNAL, Code 6 Abort trap: 6
Terminating Process:   fileproviderd [12223]

Application Specific Information:
Assertion failed: (Unexpected function requested: existing (null), requested (null)), function xxx

@tqtifnypmb
Copy link
Author

tqtifnypmb commented Aug 8, 2024

I found no Application Specific Information section among the logs I have.

However, I noticed that all the devices that crashed are quite old models (iPhone 10, iPhone 11, iPad 8, iPad 11). I'm not sure if the issue is related to their chips.

AppVariant:          1:iPad11,3:16
Code Type:           ARM-64 (Native)
Role:                Foreground
Parent Process:      launchd [1]

Date/Time:           2024-07-27 10:43:59.2120 +0200
Launch Time:         2024-07-27 10:43:45.7787 +0200
OS Version:          iPhone OS 17.5.1 (21F90)
Release Type:        User
Report Version:      104

Exception Type:  EXC_CRASH (SIGABRT)
Exception Codes: 0x0000000000000000, 0x0000000000000000
Termination Reason: SIGNAL 6 Abort trap: 6

Triggered by Thread:  12
Thread 12 crashed with ARM Thread State (64-bit):
    x0: 0x0000000000000000   x1: 0x0000000000000000   x2: 0x0000000000000000   x3: 0x0000000000000000
    x4: 0x00000001fb8feea6   x5: 0x000000000000000e   x6: 0x0000000000000020   x7: 0x0000000000000450
    x8: 0xa0a9d41e6bff9b04   x9: 0xa0a9d41f1bc16b04  x10: 0x0000000000001b00  x11: 0x00000001703edd60
   x12: 0x00000000000007fb  x13: 0x00000000000007fd  x14: 0x0000000000000010  x15: 0x0000000000000037
   x16: 0x0000000000000148  x17: 0x00000001703ef000  x18: 0x0000000000000000  x19: 0x0000000000000006
   x20: 0x000000000000c407  x21: 0x00000001703ef0e0  x22: 0x0000000205e678a0  x23: 0x000000010617e740
   x24: 0x0000000301c7b780  x25: 0x00000001703eea68  x26: 0x00000001703eec40  x27: 0x0000000300b0ed60
   x28: 0x7ffffffffffffff8   fp: 0x00000001703ee1a0   lr: 0x00000001f6df1c0c
    sp: 0x00000001703ee180   pc: 0x00000001e309342c cpsr: 0x40000000
   esr: 0x56000080  Address size fault

@davidkoski
Copy link
Collaborator

Yes, that likely is an issue -- older GPUs didn't support the same Metal capabilities. I know mlx-swift works on an iPhone 12 Pro Max but I haven't tested on anything older.

There are some more details here: ml-explore/mlx-swift-examples#21

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants