Developers Club geek daily blog

2 years, 11 months ago
Recently I see how quite large number of people applies container virtualization only to lock potentially unsafe application in the container. As a rule, use for this Docker because of its prevalence, and do not know anything better. Really, many demons are originally started on behalf of root, and further or lower the privileges, or master-process generates the processing processes with the lowered privileges. And is also such which work only from root. If in the demon find vulnerability which allows to get access with the maximum privileges, it will be not really pleasant to find the malefactors who were already in time to download all data and to leave viruses.

The containerization provided to Docker and other similar software really rescues from this problem, but also and introduces new: it is necessary to create the container for each demon, to care for safety of the changed files, to update a basic image and containers are often based on different OS which need to be stored on a disk though they, in general, also are not especially necessary to you. What to do if you do not need containers per se, in Docker Hub the application is collected not as it is necessary for you and the version became outdated, SELinux and AppArmor seem to you too difficult, and you would like to start it in your environment, but using the same isolation which is used by Docker?


In what difference of the normal user from root? Why root can manage a network, load kernel modules, mount file systems, kill processes of any users, and the normal user is deprived of such opportunities? It is all about capabilities — means for management of privileges. All these privileges are given to the user with UID 0 (i.e. root) by default, and the normal user has no of them. The privilege can both be given, and to select. So, for example, the usual ping command demands creation of a RAW socket that it is impossible to make on behalf of the normal user. Historically, on ping put a SUID flag which just started the program on behalf of the superuser, but now all modern distribution kits expose CAP_NET_RAW capability which allows to start ping from under any account.

It is possible to receive the list of the set file capabilities command getcap from structure of libcap.

% getcap $(which ping)
/usr/bin/ping = cap_net_raw+ep

P flag means permitted here, i.e. the application has an opportunity to use the set capability, e means effective — it will use the application, and there is still a flag of i — inheritable that gives the chance to save the capabilities list at function call execve().

Capabilities can be set as at the level of FS, and just at a separate flow of the program. It is impossible to receive capability which was not available since launch, i.e. privileges can only be lowered, but not to raise.

Also there are bits of safety (Secure Bits), their three: KEEP_CAPS allows to save capability by a challenge of setuid, NO_SETUID_FIXUP turns off reconfiguration of capability by setuid challenge, and NOROOT prohibits issue of additional privileges at start of suid-programs.


An opportunity to place the application in the namespaces (namespaces) — one more possibility of a kernel of Linux. Separate namespaces can be set for:

  • File system
  • UTS (host name)
  • System V IPC (interprocessor interaction)
  • Networks
  • PID
  • Users

If we place the application, for example, in separate network space, it will not be able to see our network adapters which are visible from a host. The same can be done also with file system.


Fortunately, systemd supports all necessary for isolation of applications and differentiation of the rights.

We will also use these opportunities, but at first we will a little think over what rights are necessary to our application.

So, what demons are? There are those for which the superuser's rights in general are not required, and they use them only to listen to port lower than 1024. It is enough to such programs to issue capability CAP_NET_BIND_SERVICE which will allow them to listen to any ports without restrictions, and at once to start them from the unprivileged user. It is possible to set capability on the file command setcap. As experimental "service" we will have ncat from structure of nmap which will issue shell-access to anyone — could not be worse:

% sudo setcap CAP_NET_BIND_SERVICE=ep /usr/bin/ncat
% getcap /usr/bin/ncat
/usr/bin/ncat = cap_net_bind_service+ep

Now we write the elementary systemd unit which will start ncat with necessary parameters on port 81 from nobody user name:


ExecStart=/usr/bin/ncat --exec /bin/bash -l 81 --keep-open --allow ::1

We save it in /etc/systemd/system/vuln.service also we start usual sudo systemctl start vuln.

We are connected to it:

% ncat ::1 81

Works, perfectly!

Time to protect our service came, for this purpose systemd have following directives:

  • CapabilityBoundingSet = — manages capabilities. Establishes only that were transferred in this parameter, or on the contrary, takes away transferred if before the first there is a character a tilde "~".
  • SecureBits = — sets safety bits.
  • Capabilities = — too manages capabilities, but in such a way that capabilities stated in the file at the level of FS so it is almost useless have advantage.
  • ReadWriteDirectories=, ReadOnlyDirectories=, InaccessibleDirectories = — manage namespace of file system. Will remount FS in namespace of the demon in such a way that the set directories are available to a read and write, only to reading, or in general are unavailable (become empty).
  • PrivateTmp = — will remount / tmp and / var/tmp in own tmpfs in namespace.
  • PrivateDevices = — selects access to devices from / dev, leaving access only to standard devices, it seems / dev/null, / dev/zero, / and we designate dev/random.
  • PrivateNetwork = — creates empty network namespace with one lo interface.
  • ProtectSystem = — mounts / usr and / boot in a read-only mode, and at a broadcast of argument "full", does the same also with / etc.
  • ProtectHome = — does by unavailable directories/home, / root and / run/user, or will remount them in a read-only mode with the read-only parameter
  • NoNewPrivileges = — allows to make sure that the application will not receive additional privileges. According to statements of authors, is more powerful, than the corresponding capability.
  • SystemCallFilter = — filters system calls with use of the seccomp technology. About it a bit later.

Let's rewrite our unit-file using these options:


ExecStart=/usr/bin/ncat --exec /bin/bash -l 81 --keep-open --allow ::1

So, we issued to our application one capability CAP_NET_BIND_SERVICE, created separate / tmp and / var/tmp, selected access to devices and house directories, remounted / usr, / boot and / etc in a read-only mode, and separately blocked / sys since the typical demon there will hardly get, and all this is executed from a user name.

It should be noted that CapabilityBoundingSet does not allow to catch additional capabilities even to suid-applications like su or sudo therefore we will not be able to get access on behalf of other user or a rue, even knowing their passwords since the kernel will not allow to execute challenges of setuid and setgid:

% ncat ::1 81           
python -c 'import pty; pty.spawn("/bin/bash")'   # создает новый pty, без него не получится использовать sudo или su
[nobody@valaptop /]$ sudo -i    # запрет setuid() и setgid()
sudo: unable to change to root gid: Operation not permitted
sudo: unable to initialize policy plugin
[nobody@valaptop /]$ ping   # запрет получения capability cap_net_raw
bash: /usr/sbin/ping: Operation not permitted
[nobody@valaptop /]$ cd /home
bash: cd: /home: Permission denied
[nobody@valaptop /]$ ls -lad /home
d--------- 2 root root 40 Nov  3 11:46 /home
[nobody@valaptop tmp]$ ls -la /tmp
total 4
drwxrwxrwt  2 root root   40 Nov  5 00:31 .
drwxr-xr-x 19 root root 4096 Nov  3 22:28 ..

Let's consider the second type of demons, those which are started from root and lower the privileges. Such approach is used for many purposes: reading of confidential files which are available only from the superuser (for example, a private key to use by the TLS Web server), maintaining logs which will not be available in case of a не-root compromise a forka, and just applications which randomly change UID (ssh-servers, ftp-servers). If such programs not to isolate, that terrible that can happen — the malefactor gets full access on behalf of the superuser. Though lack of capabilities inherent in root, do of it almost normal unprivileged user, root all the same remains to root'om with a lot of the files belonging to it which he can read therefore we need to be convinced of unavailability of separate directories where keys and configuration files which should not be read can be stored in addition:


ExecStart=/usr/bin/ncat --exec /bin/bash -l 81 --keep-open --allow ::1

Here we added capability CAP_SETUID and CAP_SETGID in order that our demon could lower privileges, used NoNewPrivileges that it could not raise to itself(himself) capabilities, blocked access to directories which he should not read, and permitted access to / proc only on reading that it was impossible to use sysctl. It is also possible to mount all root in read-only at once, and the rights to record to give only in those directories which are used by the program.

It is necessary to be convinced of access rights to file/etc/shadow separately. In modern distribution kits it is not available on reading even to root, and capability CAP_DAC_OVERRIDE which allows to ignore access rights is applied to work with it.

% ls -la /etc/shadow
---------- 1 root root 1214 ноя  3 19:57 /etc/shadow

We check our settings!

python -c 'import pty; pty.spawn("/bin/bash")'   # создает новый pty
[root@valaptop /]# whoami
[root@valaptop /]# ping   # запрет получения capability cap_net_raw
bash: /usr/sbin/ping: Operation not permitted
[root@valaptop /]# cat /etc/shadow   # нет CAP_DAC_OVERRIDE
cat: /etc/shadow: Permission denied
[root@valaptop /]# cd /etc/openvpn
bash: cd: /etc/openvpn: Permission denied
[root@valaptop /]# /suid   # SUID shell
[root@valaptop /]# cat /etc/shadow   # уже из-под нового shell, прав не прибавилось
cat: /etc/shadow: Permission denied

Unfortunately, systemd (yet) is not able to work with PID namespace so our root-demon can kill other programs which are executed from under root.

In general, on it it is possible and to finish, capabilities and settings of namespaces well perform the work on isolation of applications, but there is one more thing which would be exciting to be configured.


The seccomp technology prohibits the program to execute certain system calls, at once killing her in attempt it to make. Though seccomp appeared long ago, in 2005, really began to use it rather recently, with release of Chrome 20, vsftpd 3.0 and OpenSSH 6.0.

There are two approaches to use of seccomp: black list and white list. It is much simpler to make a black list of potentially dangerous challenges white therefore this approach is used more often. The firejail project by default prohibits to execute to programs the following syscall'y (the tilde includes the mode of a black list):

SystemCallFilter=~mount umount2 ptrace kexec_load open_by_handle_at init_module \
finit_module delete_module iopl ioperm swapon swapoff \
syslog process_vm_readv process_vm_writev \
sysfs_sysctl adjtimex clock_adjtime lookup_dcookie \
perf_event_open fanotify_init kcmp add_key request_key \
keyctl uselib acct modify_ldt pivot_root io_setup \
io_destroy io_getevents io_submit io_cancel \
remap_file_pages mbind get_mempolicy set_mempolicy \
migrate_pages move_pages vmsplice perf_event_open

In systemd to version 227 inclusive there is a bug which demands the NoNewPrivileges=true installation for use of seccomp.

The white list can be made as follows:

  1. We start the required program under strace:
    % strace -qcf nginx

    We receive the big table syscall'ov:
     time     seconds  usecs/call     calls    errors syscall
    ------ ----------- ----------- --------- --------- ----------------
      0.00    0.000000           0        24           read
      0.00    0.000000           0        27           open
      0.00    0.000000           0        32           close
      0.00    0.000000           0         6           stat
      0.00    0.000000           0         1           set_tid_address
      0.00    0.000000           0         4           epoll_ctl
      0.00    0.000000           0         3           set_robust_list
      0.00    0.000000           0         2           eventfd2

  2. We rewrite them everything, we set as SystemCallFilter. Most likely, your application will fall since strace found not all challenges. We look at what execution of a challenge the application came to the end, in logs of the demon of audit:

    type=SECCOMP msg=audit(1446730375.597:7943724): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=11915 comm="(nginx)" exe="/usr/lib/systemd/systemd" sig=31 arch=40000003 syscall=191 compat=0 ip=0xb75e5be8 code=0x0
    Number of syscall necessary to us — 191. We open the table of challenges and we look for the name of this challenge according to number.
  3. We add it to the permitted challenges. In case of falling, we return to point 2.

Tips &Tricks

It is possible to check the current privileges and a possibility of their increase by the captest command.

filecap will output to you the file list with set to capabilities.

By means of netcap it is possible to receive the list of the started network programs having at least one socket and one capability, and pscap will display not only network started by software.

It is not obligatory to edit entirely systemd unit and to monitor its changes when updating, and it is better to add necessary directives through systemctl edit.

This article is a translation of the original post at
If you have any questions regarding the material covered in the article above, please, contact the original author of the post.
If you have any complaints about this article or you want this article to be deleted, please, drop an email here:

We believe that the knowledge, which is available at the most popular Russian IT blog, should be accessed by everyone, even though it is poorly translated.
Shared knowledge makes the world better.
Best wishes.

comments powered by Disqus