Linux Security Modules (LSMs) vs Secure Computing Mode (seccomp)

image.png

You are a security conscious systems-engineer using a Linux-based operating system for your project. You’ve already taken a tour of Linux Security Modules (LSMs) and know how you might use them to increase the security of your system.

However, you may have also heard about Linux’s Secure Computing (seccomp) facilities. You may wonder how LSMs and seccomp compare to one another, why you cannot implement the features of seccomp as an LSM, and when you should use each. This post aims to provide some answers.

Seccomp and LSMs both result in the kernel constraining how a process interacts with the system, but with important differences. Namely, Secure Computing Mode, or seccomp, is about limiting the system calls a process can make. LSMs, in contrast, are about controlling access to objects in the kernel.

A Quick seccomp Primer

Andrea Arcangeli initially proposed seccomp in 2004 as a way for a process to transition itself to a kernel-enforced mode where only the readwriteexit, and sigreturn system calls are possible. Andrea used this capability as part of the CPUShare project. CPUShare intended to create a market for buying a selling unused CPU cycles and needed a method to limit what untrusted code could do. Sellers wouldn’t be too happy if buyers could easily launch a denial of service attack via a fork bomb. Andrea could have used an interpreter with a constrained runtime to limit untrusted code but, for efficiency and flexibility, he designed CPUShare so binary code could run directly on the hardware. CPUShare needed to constrain this untrusted binary code so it could only read input, make calculations, and write its results. It should permit nothing else. With seccomp, a management process could load an untrusted shared object file, open files for input and output, and enter seccomp mode before calling the untrusted shared object’s entry point. The constrained process now cannot open any new files, change directories, fork new processes, spawn threads, exec new programs, or anything except read and write to its open files and exit gracefully.

This strict form of seccomp provided strong isolation but was too limiting to be more generally useful. Needed was a way for processes to provide the policy for the kernel to use when deciding which system calls to allow. In 2012, 8 years after seccomp was first introduced, kernel developers merged seccomp filter modeallowing processes to define their own system call filtering policy.

Seccomp filter mode allows a process to install Berkeley Packet Filter (BPF) byte-code via the prctl system call. Once installed, this BPF program can prevent the calling process, or any descendants, from making any system calls.

For example, we see in the following snippet from the linux sources a BPF filter that kills a process when it makes the getpid system call:

  struct sock_filter filter[] = {
    BPF_STMT(BPF_LD|BPF_W|BPF_ABS,
            offsetof(struct seccomp_data, nr)),
    BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, __NR_getpid, 0, 1),
    BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_KILL),
    BPF_STMT(BPF_RET|BPF_K, SECCOMP_RET_ALLOW),
  };
  struct sock_fprog prog = {
    .len = (unsigned short)ARRAY_SIZE(filter),
    .filter = filter,
  };
  ...
  ret = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog);

When installed for a task, seccomp filters run very early in the system call handling chain—after tracing but before dispatching through the system call table. Also worth noting is the seccomp_data structure shown below.

/**
 * struct seccomp_data - the format the BPF program executes over.
 * @nr: the system call number
 * @arch: indicates system call convention as an AUDIT_ARCH_* value
 *        as defined in <linux/audit.h>.
 * @instruction_pointer: at the time of the system call.
 * @args: up to 6 system call arguments always stored as 64-bit values
 *        regardless of the architecture.
 */
struct seccomp_data {
  int nr;
  __u32 arch;
  __u64 instruction_pointer;
  __u64 args[6];
};

BPF does not (currently) provide a mechanism to dereference pointers or otherwise examine the structures passed from user space. Only the immediate arguments, copied into the seccomp_data structure, are available. Therefore, BPF filters cannot use strings passed from user space to decide which system calls to allow.

Why can’t we just use LSMs?

LSMs and seccomp are both useful tools for increasing the security and safety of your systems. First, it is important to remember that an LSM, which implements Mandatory Access Control, protects the objects—files, inodes, task structures, interprocess communication structures—the kernel uses internally. It does this by allowing LSMs to attach context information to these objects, which it can check against a previously loaded policy.  A security administrator with the required permissions (typically root) must load and manage this policy. Unprivileged users normally cannot change the policy.

In contrast, seccomp allows an unprivileged process to constrain itself. In the same way an unprivileged process can drop capabilities, the seccomp modes allow an unprivileged process to drop the ability to make certain system calls. For example, the Unix cat utility opens and reads files specified on its command-line and writes their contents to its standard out. It does not need the ability to write any file descriptors besides stdout (2) and stderr (3). Therefore, the developers of catcould install a seccomp filter to kill itself if it calls write with any file descriptor besides 2 or 3.

Implementing similar functionality via LSMs is more involved since a processes’ standard out can refer to different kernel objects—device, file, pipe—and most LSMs do not provide a direct way to map these objects to file descriptors. Also complicating matters is the fact that there are competing LSMs that implement mandatory access control. SELinux, AppArmor, Tomoyo, and SMACK all have their own mechanisms for managing policy which would either require a developer to generate policy for each LSM, or constrain the user to only using a single one.

Alternatively, you could write your own custom LSM, but then you would need to build it directly into the kernel and devise a way to represent your specific policy, both which could be time consuming and fragile. Seccomp has the advantage here in that, if you have a properly configured kernel, it is available to use regardless of how a system may have their LSMs configured. From a binary perspective, there is only a single, well-understood way for a process to attach filters to itself or its descendants.

Finally, since the kernel checks seccomp filters so early in the system call handling sequence, they reduce the amount of code that an attacker can search for exploitable flaws. This reduces the kernels attack surface from these processes. LSMs, because they hook more deeply in the system call sequence, do not reduce the attack surface to the same extent that seccomp filters do. Specifically, the kernel runs the hooks after it has mapped system call parameters to internal objects, and this code may all contain flaws either now or after future modifications. Again, seccomp reduces this attack surface.

Why can’t we just use seccomp?

Seccomp is incapable of modeling the same policies LSMs are. For example, you cannot write a seccomp filter to prevent a user process from opening files in only certain locations in the filesystem, like /etc/password. Since seccomp filters cannot dereference pointers, they cannot compare the paths users pass as arguments to the open system call (like AppArmor) nor are they able to examine inodes to read security attributes attached to files (like SELinux).

Additionally, no mechanism exists to remove or modify a filter once attached to a process. You can only attach new filters to a process, and new filters cannot reverse the effects of existing filters. Therefore, it is challenging to create a set of filters that apply to the system globally. The closest you can get is to use systemd system call filtering or doing your work in containers which use seccomp filters (like docker).

Even with systemd system call filtering and docker seccomp policies, seccomp filters are best managed by application developers and not system administrators. Application developers should know better than administrators which system calls their applications need to operate. Also, developers should be in the best position to know which filters to update as their applications change.

Conclusion

Both LSMs and seccomp provide mechanisms to limit a processes’ interaction with a system. They aren’t tools to prevent your processes from being hacked, but they are both tools that can prevent an attacker from taking advantage of a flaw in one program to move to other areas of your system. Mandatory Access Control policies implemented via LSMs are the tool of choice when you seek to create a fine-grained security policy globally for the system. Seccomp filters are the tool of choice for an unprivileged process to drop its ability to make certain system calls, and can be an important component of more general process sandboxing techniques like Linux containers.