Using MSR_LSTAR to hook system calls in Linux (Intel 64-bit only) ----------------------------------------------------------------- I recently had to implement a Linux kernel module that would - among other things - hook particular system calls to execute a short snippet of code before returning control to the actual implementation. Naturally, this seemed like a typical jprobe job. However, when reading the small prints [1], we see that: Probe handlers are run with preemption disabled. Depending on the architecture and optimization state, handlers may also run with interrupts disabled (e.g., kretprobe handlers and optimized kprobe handlers run without interrupt disabled on x86/x86-64). In any case, your handler should not yield the CPU (e.g., by attempting to acquire a semaphore). And "attempting to acquire a sempahore" was of course one of the things that I needed to do. Searching the web led me to [2], a very nice blog by Saad Talaat that explains the different options: 1) Patching the system call table 2) Patching the Interrupt Descriptor Table (IDT) 3) Patching MSR SYSENTER/SYSCALL For reasons I forgot already, I didn't look at 1) and 2), but decided to go for 3) right away. While the blog post provides a short example, I found that it didn't work as expected (kernel panic). I decided to have a closer look and came up with a - at least it is for me - working solution that I will explain in more detail now. Concept ------- The idea is actually pretty straightforward. Whenever the syscall instruction is executed, the processor stores RIP in RCX and jumps to the address stored in the LSTAR Model Specific Register (MSR). By changing the value of the LSTAR register, we can modify the kernel's system call entry point. It is interesting to see how the Linux kernel handles this. A search on lxr.free-elections.com leads to the arch/x86/kernel/cpu/common.c file [3]: void syscall_init(void) { ... wrmsrl(MSR_LSTAR, system_call); ... } The passed system_call function is defined in arch/x86/kernel/entry_64.S: ENTRY(system_call) CFI_STARTPROC simple CFI_SIGNAL_FRAME CFI_DEF_CFA rsp,KERNEL_STACK_OFFSET CFI_REGISTER rip,rcx /*CFI_REGISTER rflags,r11*/ SWAPGS_UNSAFE_STACK /* * A hypervisor implementation might want to use a label * after the swapgs, so that it can do the swapgs * for the guest and jump here on syscall. */ GLOBAL(system_call_after_swapgs) movq %rsp,PER_CPU_VAR(old_rsp) movq PER_CPU_VAR(kernel_stack),%rsp /* * No need to follow this irqs off/on section - it's straight * and short: */ ENABLE_INTERRUPTS(CLBR_NONE) SAVE_ARGS 8, 0, rax_enosys=1 ... call *sys_call_table(,%rax,8) # XXX: rip relative ... END(system_call) Debugging --------- After trying numerous blind variants of the code outlined at [2], I finally sat down for a decent debugging session. I first wrote some code that would dump the current bytes of the original system_call handler: void print_bytestream(char *name, void *pointer, int bytes) { int i; printk("%s @ %p:\n", name, pointer); for (i = 0; i < bytes; i++) { printk("%02x ", (unsigned char) *(unsigned char *)(pointer + i)); } printk("\n"); } ... void (*syscall_orig)(void) = NULL; uint64_t value; /* Read the original syscall entry points. */ rdmsrl(MSR_LSTAR, value); syscall_orig = (void (*)(void)) value; /* Debug print the original system call handler */ print_bytestream("syscall_orig", syscall_orig, 256); On my current machine, this gives me in my syslog: syscall_orig @ ffffffff817318f0: Feb 4 01:41:57 i5 kernel: [10524.623674] 0f 01 f8 65 48 89 24 25 00 b0 00 00 65 48 8b 24 25 30 b8 00 00 fb 66 66 66 90 66 66 90 48 83 ec 50 48 89 7c 24 40 48 89 74 24 38 48 89 54 24 30 48 89 44 24 20 4c 89 44 24 18 4c 89 4c 24 10 4c 89 54 24 08 4c 89 1c 24 48 89 44 24 48 48 89 4c 24 50 f7 84 24 88 e0 ff ff d1 01 08 10 0f 85 4b 01 00 00 25 ff ff ff bf 3d 20 02 00 00 0f 87 c2 00 00 00 4c 89 d1 ff 14 c5 00 14 80 81 48 89 44 24 20 bf ff fe 08 10 fa 66 66 66 90 66 66 90 8b 94 24 88 e0 ff ff 21 fa 75 3b 48 8b 4c 24 50 4c 8b 1c 24 4c 8b 54 24 08 4c 8b 4c 24 10 4c 8b 44 24 18 48 8b 44 24 20 48 8b 54 24 30 48 8b 74 24 38 48 8b 7c 24 40 65 48 8b 24 25 00 b0 00 00 0f 01 f8 48 0f 07 0f ba e2 03 73 11 fb 66 66 66 90 66 66 90 57 e8 57 3d ff ff 5f eb 9b fb 66 66 66 90 66 66 90 0f ba e2 07 0f 82 8f 00 00 00 65 4c Using the online disassembler ODA [4], I could figure out what was happening: 0f01f8 swapgs ; obtain a pointer to kernel data [5] 654889242500b00000 mov %rsp,%gs:0xb000 ; old_rsp = rsp 65488b242530b80000 mov %gs:0xb830,%rsp ; rsp = kernel_stack fb sti ; enable interrupts 66666690 data32 data32 xchg %ax,%ax ; NOP 666690 data32 xchg %ax,%ax ; NOP 4883ec50 sub $0x50,%rsp ; make space to save registers 48897c2440 mov %rdi,0x40(%rsp) ; save rdi 4889742438 mov %rsi,0x38(%rsp) ; save rsi 4889542430 mov %rdx,0x30(%rsp) ; save rdx 4889442420 mov %rax,0x20(%rsp) ; save rax 4c89442418 mov %r8,0x18(%rsp) ; save r8 4c894c2410 mov %r9,0x10(%rsp) ; save r9 4c89542408 mov %r10,0x8(%rsp) ; save r10 4c891c24 mov %r11,(%rsp) ; save r11 4889442448 mov %rax,0x48(%rsp) ; save rax 48894c2450 mov %rcx,0x50(%rsp) ; save rcx f7842488e0ffffd1010810 testl $0x100801d1,-0x1f78(%rsp) ; should we do syscall tracing... 0f854b010000 jne 0x000001ae ; ...if so, do it 25ffffffbf and $0xbfffffff,%eax ; test if syscall number... 3d20020000 cmp $0x220,%eax ; ...is within boundaries 0f87c2000000 ja 0x00000135 ; ...if not, go away 4c89d1 mov %r10,%rcx ; rcx = r10 [*1] ff14c500148081 callq *-0x7e7fec00(,%rax,8) ; call *sys_call_table(,%rax,8) ... *1. This mov is required to ensure the regular calling convention (where rcx contains the fourth integer or pointer argument). Remember that the syscall instruction caused rcx to contain rip. This also explains the "For system calls, R10 is used instead of RCX" in [6]. With this information, we are now able to write a syscall dispatcher replacement. A simply mimick of the original implementation was enough for me: .section .text .global syscall_new #define OLD_RSP 0xb000 #define KERNEL_STACK 0xb830 syscall_new: swapgs ; note that this is simply copied from above, no mov %rsp,%gs:OLD_RSP ; fundamental changes. mov %gs:KERNEL_STACK,%rsp sti sub $0x50,%rsp mov %rdi,0x40(%rsp) mov %rsi,0x38(%rsp) mov %rdx,0x30(%rsp) mov %rax,0x20(%rsp) mov %r8,0x18(%rsp) mov %r9,0x10(%rsp) mov %r10,0x8(%rsp) mov %r11,(%rsp) mov %rax,0x48(%rsp) mov %rcx,0x50(%rsp) ; functionality goes here cli ; This is just a reverse. if we dumped more mov 0x50(%rsp),%rcx ; bytes earlier, we would have seen similar code mov (%rsp),%r11 ; in the ODA disassembly. mov 0x8(%rsp),%r10 mov 0x10(%rsp),%r9 mov 0x18(%rsp),%r8 mov 0x20(%rsp),%rax mov 0x30(%rsp),%rdx mov 0x38(%rsp),%rsi mov 0x40(%rsp),%rdi mov %gs:OLD_RSP,%rsp jmp *syscall_after_swapgs ; Jump back to GLOBAL(system_call_after_swapgs) ; as swapgs is expensive. ; We could optimize (i.e., remove) this block by ; jumping to the loction of above ; testl $0x100801d1, -0x1f78(%rsp) ; instead. Accompanied with just a bit of C code, we now have a handle for syscall interception. All that is left is a couple of symbol lookups, storing the original system call entry address and overwriting MSR_LSTAR accordingly. When done, we restore MSR_LSTAR with the original address. One last important aspect, however, is that the MSR_LSTAR is (hyper)thread specific. Since my processor consists of 4 CPU cores, I have to ensure that the LSTAR MSR for all cores is updated accordingly. Forgetting this upon the overwrite is not that a big deal: you'll only hook system calls that are performed by a single processor core. Forgetting this when restoring, however, will result in an awesome double fault race condition: if you are restoring LSTAR for a different core than it was set, than that original core still expects a syscall handler at the new address. If the module is rmmod'ed, well, yeah :) (needless to say, I was having this exact issue) Implementation -------------- Let's have a look at a very minimal kernel module that notifies you whenever the SIGRETURN syscall is called: File 1. constants.h --------------------------- #ifndef SYSCALL_CONSTANTS #define SYSCALL_CONSTANTS /* These defines are used by both the C and assembly code. */ #define OLD_RSP 0xb000 #define KERNEL_STACK 0xb830 #endif // SYSCALL_CONSTANTS File 2. intercept.c ------------------- /* Minimal includes */ #include #include #include #include "constants.h" MODULE_AUTHOR("Victor van der Veen"); MODULE_DESCRIPTION("Intercept syscalls via MSR_LSTAR"); MODULE_LICENSE("GPL"); /* This global variable will hold the address of the original syscall entry */ void (*syscall_orig)(void) = NULL; /* This global variable will hold the address of the system_call_after_swapgs * label. */ void (*syscall_after_swapgs)(void); /* This global variable is defined in dispatcher.S */ extern void syscall_new(void); /* This function will be executed right before the syscall */ void do_c_sigreturn(void) { printk("SIGRETURN!\n"); } /* Used by on_each_cpu() to write a value to all LSTAR registers */ void update_lstar(void *addr) { wrmsrl(MSR_LSTAR, addr); } /* Initialize system call interception */ int intercept_syscalls_init(void) { uint64_t addr_kernel_stack, addr_old_rsp; uint64_t value; /* Make sure that OLD_RSP and KERNEL_STACK are defined properly */ addr_old_rsp = kallsyms_lookup_name("old_rsp"); addr_kernel_stack = kallsyms_lookup_name("kernel_stack"); if (addr_old_rsp != OLD_RSP || addr_kernel_stack != KERNEL_STACK) { printk("Wrong values for OLD_RSP and/or KERNEL_STACK (was your kernel updated?)\n"); printk("Please update constants.S:\n"); printk("#define OLD_RSP %p\n", (void *) addr_old_rsp); printk("#define KERNEL_STACK %p\n", (void *) addr_kernel_stack); return -1; } /* Read the original syscall entry points, assuming it is the same for all * CPUs. */ rdmsrl(MSR_LSTAR, value); syscall_orig = (void (*)(void)) value; /* Figure out where system_call_after_swapgs relies */ syscall_after_swapgs = (void (*)(void)) kallsyms_lookup_name("system_call_after_swapgs"); /* Overwrite the syscall entry point, all of them. */ on_each_cpu(update_lstar, syscall_new, 1); // 1 tells on_each_cpu() to wait for update_lstar to finish return 0; } /* Stop system call interception */ void intercept_syscalls_exit(void) { if (syscall_orig == NULL) return; /* Restore the syscall entry point, all of them. */ on_each_cpu(update_lstar, syscall_orig, 1); /* Give the system some time to adjust. If the module is being removed too * fast, code currently running in our system handler may cause a kernel * crash. */ msleep(1000); syscall_orig = NULL; } /* Module entry/exit points */ static int __init intercept_init(void) { if (intercept_syscalls_init() < 0) { printk("Failed to intercept system calls\n"); return -1; } return 0; } static void __exit intercept_exit(void) { intercept_syscalls_exit(); } module_init(intercept_init); module_exit(intercept_exit); File 3. dispatcher.S -------------------- #include "constants.h" .section .text /* syscall_new will be used by intercept.c */ .global syscall_new /* Two macros to SAVE or CONTINUE (restore) context */ .macro SAVE swapgs mov %rsp,%gs:OLD_RSP mov %gs:KERNEL_STACK,%rsp sti sub $0x50,%rsp mov %rdi,0x40(%rsp) mov %rsi,0x38(%rsp) mov %rdx,0x30(%rsp) mov %rax,0x20(%rsp) mov %r8,0x18(%rsp) mov %r9,0x10(%rsp) mov %r10,0x8(%rsp) mov %r11,(%rsp) mov %rax,0x48(%rsp) mov %rcx,0x50(%rsp) .endm .macro CONTINUE cli mov 0x50(%rsp),%rcx mov (%rsp),%r11 mov 0x8(%rsp),%r10 mov 0x10(%rsp),%r9 mov 0x18(%rsp),%r8 mov 0x20(%rsp),%rax mov 0x30(%rsp),%rdx mov 0x38(%rsp),%rsi mov 0x40(%rsp),%rdi mov %gs:OLD_RSP,%rsp jmp *syscall_after_swapgs .endm /* This becomes the new entry point for syscall() */ syscall_new: /* The syscall number is stored in %rax. syscall 0xf = sigreturn */ cmp $0x0f, %rax je do_sigreturn /* Fallthrough to the original system call implementation, no need to do * saving and restoring here. */ jmp *syscall_orig /* This is simply a stub that saves, calls our C function, and restores + * continues execution. */ do_sigreturn: SAVE call do_c_sigreturn CONTINUE file 4: Makefile ---------------- obj-m += intercept-module.o intercept-module-objs = intercept.o dispatcher.o all: make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules clean: make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean ______________________________________________________________________ Store all files in a directory and build using make. Loading the module (with sudo insmod intercept-module.ko) will likely fail with an error in syslog like: Feb 4 03:22:34 i5 kernel: [16556.980289] Wrong values for OLD_RSP and/or KERNEL_STACK (was your kernel updated?) Feb 4 03:22:34 i5 kernel: [16556.980293] Please update constants.S: Feb 4 03:22:34 i5 kernel: [16556.980295] #define OLD_RSP 000000000000b000 Feb 4 03:22:34 i5 kernel: [16556.980297] #define KERNEL_STACK 000000000000b830 Feb 4 03:22:34 i5 kernel: [16556.980298] Failed to intercept system calls Copy/paste your values for OLD_RSP and KERNEL_STACK in constants.h and try again. You should now see a Feb 4 03:24:58 i5 kernel: [16700.670548] SIGRETURN! message in /var/log/syslog everytime the sigreturn system call is executed, which is quite often if you use an input device such as, ... a mouse :) It should now be possible to do lots of awesome stuff (blocking on semaphores!) that would not have been possible using jprobes. It is of course questionable if this is totally safe and robust... Observe that in contrast to what is the case with system call table overwriting, the LSTAR method also allows you to easily add new system calls without having to compile the entire kernel. To me, this method also feels like a relatively clean approach in contrast to the others outlined in [2]. This should be all for now. - vvdveen [1] https://www.kernel.org/doc/Documentation/kprobes.txt [2] https://ruinedsec.wordpress.com/2013/04/04/modifying-system-calls-dispatching-linux/ [3] http://lxr.free-electrons.com/source/arch/x86/kernel/cpu/common.c [4] http://www2.onlinedisassembler.com/odaweb/IHjJsl/0 [5] http://www.x86-64.org/pipermail/discuss/2000-October/001009.html [6] http://en.wikipedia.org/wiki/X86_calling_conventions#System_V_AMD64_ABI