Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to read memory in ARM vmcore captured by dump-capture kernel #461

Open
alecrivers opened this issue Jan 18, 2025 · 5 comments
Open

Comments

@alecrivers
Copy link

Hello,

First off, great project, thanks for it!

I've been debugging a nasty kernel oops, capturing vmcore files using a kexec'ed dump-capture kernel on the affected device. (I don't bother using makedumpfile to compress the cores.) I found that trying to get stack traces, e.g. prog.crashed_thread().stack_trace(), typically only showed the first stack frame, and then an empty frame at a meaningless address. The crash utility meanwhile was able to get a full stack trace, but I wanted drgn's ability to report local variables and structures.

Doing a bunch of debugging, I found that drgn's unwinder was doing the right thing in terms of looking in the right place for the next frame's FP. However, when it went to read that memory, it was getting the wrong value. I could check this by doing prog.read(<virtual memory address of the next FP>), which gave a different answer than asking crash to read the same address. Digging further, I found that the physical memory address translation was wrong. But I was surprised to find that doing prog.read(follow_phys(prog["init_mm"].address_of_(), <address>), 4, True) gave the correct answer.

Looking deeper, I found that follow_phys() and crash were both referring to the page table to get their lookup data, whereas prog.read() was using the PT_LOAD data from the core dump. readelf gave:

Elf file type is CORE (Core file)
Entry point 0x0
There are 4 program headers, starting at offset 52

Program Headers:
  Type           Offset   VirtAddr   PhysAddr   FileSiz MemSiz  Flg Align
  NOTE           0x001000 0x00000000 0x00000000 0x00d94 0x00d94     0x4
  LOAD           0x002000 0x80000000 0x10000000 0x58000000 0x58000000 RWE 0
  LOAD           0x58002000 0xe8000000 0x78000000 0x7ff0000 0x7ff0000 RWE 0
  LOAD           0x5fff2000 0xf0000000 0x80000000 0x10000000 0x10000000 RWE 0

The virtual memory address in question was inside the last range.

I found that if I ignored that last memory section (simply by skipping it in drgn_program_set_core_dump_fd_internal()), everything started working. Huzzah!

For now, because I'm on a tight deadline, I don't have time to investigate why this last section may be incorrect. But I do note that that last segment is 256MB in size, which is exactly the same size as the amount of memory reserved by crashkernel for the dump-capture kernel. (I could, but haven't, tried changing that size and seeing if it changes too, and I know next to nothing about core dumps so I can't say off the bat if it obviously is or isn't related.)

Incidentally, while trying various things before this workaround, I also tried using libkdumpfile to read the core, but ran into the issue that

#ifdef WITH_KDUMPFILE
is WITH_KDUMPFILE while I think that the correct line would be WITH_LIBKDUMPFILE. However, changing that and using libkdumpfile didn't resolve my problem.

Thanks again for a great project.

@alecrivers
Copy link
Author

Addendum: skipping that LOAD helped with the call stack of the crashing thread, but I still see the same issue with others. E.g.:

>>> stack_trace(find_task(1806))
#0  v7_dma_inv_range (/usr/src/kernel/arch/arm/mm/cache-v7.S:385)
>>> 

while crash yields

crash> set 1806       
    PID: 1806
COMMAND: "OA_Camera"
   TASK: 8ba268c0  [THREAD_INFO: 8ba268c0]
    CPU: 1
  STATE: TASK_RUNNING (ACTIVE)
crash> bt      
PID: 1806     TASK: 8ba268c0  CPU: 1    COMMAND: "OA_Camera"
 #0 [<8010f4ac>] (v7_dma_inv_range) from [<8010cd33>]
 #1 [<8010cd33>] (dma_cache_maint_page) from [<8010d35f>]
 #2 [<8010d35f>] (arch_sync_dma_for_device) from [<8017a6d7>]
 #3 [<8017a6d7>] (dma_direct_sync_sg_for_device) from [<80178f4b>]
 #4 [<80178f4b>] ($t) from [<7f8bb79f>]
 #5 [<7f8bb79f>] (__buf_prepare [videobuf2_common]) from [<7f8bcc6d>]
 #6 [<7f8bcc6d>] (vb2_core_qbuf [videobuf2_common]) from [<7f8ee429>]
 #7 [<7f8ee429>] (vb2_qbuf [videobuf2_v4l2]) from [<8057911d>]
 #8 [<8057911d>] (__video_do_ioctl) from [<80579ddb>]
 #9 [<80579ddb>] (video_usercopy) from [<8024c19b>]
#10 [<8024c19b>] (vfs_ioctl) from [<8024d03d>]
#11 [<8024d03d>] (sys_ioctl) from [<80100061>]

@brenns10
Copy link
Contributor

brenns10 commented Jan 18, 2025

Hi @alecrivers, I've definitely encountered a similar situation in #217. In that case, it was a bug in /proc/kcore reading a certain kind of virtual memory segment. Just like your situation, it was possible to have drgn do the address translation to physical, and then read the correct value using the physical address. I wonder if it's possible that there's a bug hiding somewhere in the arch-specific code for /proc/vmcore for ARM? IIRC, arm32 didn't even have support for /proc/kcore until recently when @osandov submitted some patches for it (slated for 6.14).

It'll be interesting to see if we can reproduce this in an arm32 VM. If so, that'll make it much easier to get to the root cause. As a data point, what kernel version are you using here?

Finally, regarding libkdumpfile -- nice catch on WITH_KDUMPFILE. I think in that specific location of the code, it actually doesn't matter because libdrgn/python/main.c doesn't actually need the libkdumpfile headers. It's probably a vestige from some prior time when the header did get used. The only place it matters for the macro to be correctly spelled is near the bottom, where we define the _with_libkdumpfile module variable.

But to double-check: when you say you tested it with libkdumpfile, does that mean you set DRGN_USE_LIBKDUMPFILE_FOR_ELF=1 in the environment (docs)? If not, I think that's at least worth trying. I'm not certain, but I think libkdumpfile actually goes to the trouble of doing address translation rather than preferring to rely on the ELF virtual address segments. (My guess is that that's how crash does it too, judging by your example.)

osandov added a commit that referenced this issue Jan 21, 2025
Alec Rivers noticed in #461 that WITH_LIBKDUMPFILE is misspelled as
WITH_KDUMPFILE here. The whole ifdef block isn't actually needed, so
remove it.

Fixes: 4e330bb ("cli: indicate if drgn was compiled with libkdumpfile")
Signed-off-by: Omar Sandoval <[email protected]>
@osandov
Copy link
Owner

osandov commented Jan 22, 2025

This does sound like a bug in ARM's /proc/vmcore. Unfortunately, I've never been able to get /proc/vmcore working in an ARM32 VM (see https://lore.kernel.org/linux-debuggers/ZvxT9EmYkyFuFBH9@telecaster/T/), so I can't test it.

Like @brenns10, I'd be curious to hear the results of testing with DRGN_USE_LIBKDUMPFILE_FOR_ELF=1. If that works, then we can probably add a workaround.

P.S. I just removed the incorrect WITH_KDUMPFILE, thanks for pointing that out.
P.P.S. Upstream ARM32 still doesn't have /proc/kcore, I still patch that in for testing drgn.

@brenns10
Copy link
Contributor

P.P.S. Upstream ARM32 still doesn't have /proc/kcore, I still patch that in for testing drgn.

Sorry, I confused that with your kcore performance improvements :)

@alecrivers
Copy link
Author

Thanks for the replies both. Things are a bit on fire but I will report back on the DRGN_USE_LIBKDUMPFILE_FOR_ELF results when I get the time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants