Whether you’re building a custom cluster or an entire platform like we are at Pantheon, there’s a necessity for consolidating multiple processes and daemons onto single virtual or physical machines. The reason might be cost, efficiency, or simply avoiding the latency that comes with a switched network hop. But once even two things run on the same box, isolating those things becomes a critical security and resource management issue.
Let’s start with security, since it’s a requirement — not just an optimization. The best isolation involves defense in depth; that is, knowing one security mechanism supposedly controls one kind of access shouldn’t stop you from locking down the access redundantly with other methods. Achieving defense in depth involves thought experiments where each of your isolation methods breaks down, sometimes in combination.
I like to start with some questions about each process. What happens if this process:
- has another process or an external agent exploit a flaw in it that allows arbitrary use of its normal capabilities?
- obtains read access to all the data on the system? (How about root? Can the information be used to exploit other systems or resources?)
- exhausts resources (CPU, memory, network bandwidth, block I/O)?
For the first question, we consider vulnerabilities like code and SQL injection that “infect” a process and use its capabilities for nefarious purposes. If multiple sites or databases coexist on a box, consider whether one database or one site going rogue poses a threat to the others. MySQL has file loading and writing commands that can often access anything the MySQL user can access. If PHP for all of the sites runs under the same user, could arbitrary PHP on one site read the settings.php file for another site and obtain database credentials?
Here are some mitigation strategies for having one process or external agent exploit the access of another:
- Use user isolation. Running each site in a separate PHP-FPM pool or with a separate Apache instance adds overhead but makes isolating a breach of one site far easier. Consider even running different MySQL databases under separate users. The extra memory overhead is surprisingly low.
- Consider namespace isolation. Modern Linux kernels generally support cgroups-integrated namespace controls for the network and filesystem. It’s now possible to run a process in a namespace that makes only the necessary files and network devices even appear to exist. Namespaces are never an alternative to user-based permissions, but they can augment them.
- Prefer Unix sockets to TCP/UDP ones. Not only are they faster and easier to track (because they’re named), they’re considerably easier to secure because of access to normal user/group permissions. nginx can connect to PHP-FPM via a Unix socket and should unless they’re on separate boxes. PHP-FPM’s socket then only needs to be accessible to nginx.
- Consider mandatory access control (MAC). Tools like SELinux get a (sometimes deserved) bad rap because they are arcane and picky about which processes can do what. The key advantage that MAC provides is defense in depth by having access managed by an agent (usually the kernel) that won’t get compromised at the same time as the application. The classic example is the passwd utility, which uses setuid to operate as root to update /etc/shadow. If there’s no MAC and passwd gets compromised to do arbitrary things, the exploit is equivalent to root. A MAC approach generally limits passwd to only modifying /etc/shadow, regardless of what it tries to do.
Escalation to global read or root access (the second question) is an essential concern for defense in depth beyond the machine itself. Consider whether any secrets or keys deployed to the machine — ones normally protected by other security measures — would make other systems or resources vulnerable:
- Amazon S3 secret keys may be on the system for remotely storing backups, but a rogue process obtaining access to such keys would also have access to delete the backups. You may want to have a separate agent elsewhere download and then transfer the file to S3, or connect to the box only briefly with the necessary credentials in tow. (Or, for ultimate geek points, do what we did with Pantheon; have a signing server for S3 requests that generates one-time-use signatures only good for uploading one backup.)
- Don’t leave unnecessary SSH private keys lingering on the system when SSH agent forwarding (-A) removes such a risk, at least while users are logged out.
- Manage allowed connections by IP address and port even within your clusters. A Varnish box, for example, has no business connecting to a MySQL box. Less trust within your clusters helps mitigate escalation opportunities.
Resource exhaustion (the third question) includes scenarios like the following:
- PHP using all the CPU (starving, say, the database process)
- The web server hitting disk for static files too hard (starving the database of block I/O)
- PHP processes competing with other PHP processes for CPU
- MySQL processes competing with other MySQL processes for block I/O
You may notice that, despite the resource exhaustion question being introduced in the context of security, these scenarios could occur from a denial of service attack or even just normal site operation. I’m intentionally glossing over the source of the resource exhaustion because it doesn’t matter whether it’s an attack or just poor application performance.
Historically, the only practical isolation method (which has substantial overhead) involved separate physical/virtual machines; winner-take-all methods like setting “nice” levels, or un-burstable methods like processor core restriction. The modern Linux kernel has introduced much fairer controls with cgroups. These new controls generally involve “shares” that get enforced when a resource approaches exhaustion.
For example, if one process has 10 CPU shares and another has 100, CPU contention results in the latter having 10x the amount of time. The process with 10 shares still gets its fair time, though. Shares managed by cgroups are burstable. When there’s no contention, every process group gets as much as it wants of the shared resource.
There are a few approaches for adding cgroups to your systems:
- One is available basically everywhere: libcgroup. Despite the “lib” prefix, this package includes a classification daemon that can use criteria like the process uid to assign shares of resources. For example, processes running as the nginx user may get relatively few block I/O shares compared to MySQL. Or, libcgroup may assign various PHP-FPM process pools (which should be running as different users already) CPU shares, preventing one site from starving others of CPU.
- On systemd-based distributions, you can put the cgroups shares configuration directly in the .service unit configuration. We use this approach at Pantheon because we prefer a relatively flat, lightweight service and container management strategy. It’s a lot faster and leaner to start a few services with systemd than “boot” a container managed by LXC.
- Linux Containers (LXC) offer the heaviest approach without resorting to full or para-virtualization. Each container generally has its own package configuration, root filesystem, network namespace, and process namespace. LXC makes your services and containers fairly hierarchical; that is, you may even run Upstart or systemd to manage daemons within each container. The advantage of this hierarchy is the ability to manage cgroups restrictions container-by-container.
The best way to get started with these security and resource techniques is experimentation. Set up a basic instance of Ubuntu or Fedora (a recent kernel is critical) on the cloud provider of your
choice. Then, check out the tools mentioned in below.
Restricting Resources And Communication
- For CPU, I/O, and memory: cgroups
(https://docs.fedoraproject.org/en-US/Fedora/17/html/Resource_Management_...), systemd options like "CPUShares" (http://www.freedesktop.org/software/systemd/man/systemd.exec.html), LXC
- For network communication: iptables, selinux, LXC, network namespacing in systemd
- For running a daemon as an isolated user: Upstart (recent releases), systemd (http://www.freedesktop.org/software/systemd/man/systemd.exec.html),
- For filesystem access isolation (beyond users and ACLs): filesystem namespacing in systemd, selinux, LXC, chroot (but be careful)
Viewing Current Resource Utilization And Communication
- For memory and CPU: top, htop (hierarchical top), ps aux
- For network I/O: nethogs, netstat -nlp
- For local I/O: iostat
- For MySQL: innotop, mytop
- For Varnish: varnishtop, varnishhist, varnishstat
- For systemd users (CPU, memory, and I/O): systemd-cgtop
- For testing access as a different user (even if that user's shell is set to nologin): sudo -u