9 Infrastructure

In this chapter, we take a step back from a single Apache server to discuss the infrastructure and the architecture of the system as a whole. Topics include:

Application isolation strategies
Host security
Network security
Use of a reverse proxy, including use of web application firewalls
Network design

We want to make each element of the infrastructure as secure as it can be and design it to work securely as if the others did not exist. We must do the following:

Do everything to keep attackers out.
Design the system to minimize the damage of break in.
Detect compromises as they occur.

Some sections of this chapter (the ones on host security and network security) discuss issues that not only relate to Apache, but also could be applied to running any service. I will mention them briefly so you know you need to take care of them. If you wish to explore these other issues, I recommend of the following books:

Practical Unix & Internet Security by Simson Garfinkel, Gene Spafford, and Alan Schwartz (O’Reilly)
Internet Site Security by Erik Schetina, Ken Green, and Jacob Carlson (Addison-Wesley)
Linux Server Security by Michael D. Bauer (O’Reilly)
Network Security Hacks by Andrew Lockhart (O’Reilly)

Network Security Hacks is particularly useful because it is concise and allows you to find an answer quickly. If you need to do something, you look up the hack in the table of contents, and a couple of pages later you have the problem solved.

Application Isolation Strategies

Choosing a correct application isolation strategy can have a significant effect on a project’s security. Ideally, a strategy will be selected early in the project’s life, as a joint decision of the administration and the development team. Delaying the decision may result in the inability to deploy certain configurations.

Host Security

Going backward from applications, host security is the first layer we encounter. Though we will continue to build additional defenses, the host must be secured as if no additional protection existed. (This is a recurring theme in this book.)

Restricting and Securing User Access

After the operating system installation, you will discover many shell accounts active in the /etc/passwd file. For example, each database engine comes with its own user account. Few of these accounts are needed. Review every active account and cancel the shell access of each account not needed for server operation. To do this, replace the shell specified for the user in /etc/password with /bin/false. Here is a replacement example:

ivanr:x:506:506::/home/users/ivanr:/bin/bash

with:

ivanr:x:506:506::/home/users/ivanr:/bin/false

Restrict whom you provide shell access. Users who are not security conscious represent a threat. Work to provide some other way for them to do their jobs without the shell access. Most users only need to have a way to transport files and are quite happy using FTP for that. (Unfortunately, FTP sends credentials in plaintext, making it easy to break in.)

Finally, secure the entry point for interactive access by disabling insecure plaintext protocols such as Telnet, leaving only secure shell (SSH) as a means for host access. Configure SSH to refuse direct root logins, by setting PermitRootLogin to no in the sshd_config file. Otherwise, in an environment where the root password is shared among many administrators, you may not be able to tell who was logged on at a specific time.

If possible, do not allow users to use a mixture of plaintext (insecure) and encrypted (secure) services. For example, in the case of the FTP protocol, deploy Secure FTP (SFTP) where possible. If you absolutely must use a plaintext protocol and some of the users have shells, consider opening two accounts for each such user: one account for use with secure services and the other for use with insecure services. Interactive logging should be forbidden for the latter; that way a compromise of the account is less likely to lead to an attacker gaining a shell on the system.

Deploying Minimal Services

Every open port on a host represents an entry point for an attacker. Closing as many ports as possible increases the security of a host. Operating systems often have many services enabled by default. Use the netstat tool on the command line to retrieve a complete listing of active TCP and UDP ports on the server:

# netstat -nlp
                                                                 PID/
Proto Recv-Q Send-Q Local Address   Foreign Address   State      Program name
tcp        0      0 0.0.0.0:3306    0.0.0.0:*         LISTEN     963/mysqld
tcp        0      0 0.0.0.0:110     0.0.0.0:*         LISTEN     834/xinetd
tcp        0      0 0.0.0.0:143     0.0.0.0:*         LISTEN     834/xinetd
tcp        0      0 0.0.0.0:80      0.0.0.0:*         LISTEN     13566/httpd
tcp        0      0 0.0.0.0:21      0.0.0.0:*         LISTEN     1060/proftpd
tcp        0      0 0.0.0.0:22      0.0.0.0:*         LISTEN     -
tcp        0      0 0.0.0.0:23      0.0.0.0:*         LISTEN     834/xinetd
tcp        0      0 0.0.0.0:25      0.0.0.0:*         LISTEN     979/sendmail
udp        0      0 0.0.0.0:514     0.0.0.0:*                    650/syslogd

Now that you know which services are running, turn off the ones you do not need. (You will probably want port 22 open so you can continue to access the server.) Turning services off permanently is a two-step process. First you need to turn the running instance off:

# /etc/init.d/proftpd stop

Then you need to stop the service from starting the next time the server boots. The procedure depends on the operating system. You can look in two places: on Unix systems a service is started at boot time, in which case it is permanently active; or it is started on demand, through the Internet services daemon (inetd or xinetd).

Note

Reboot the server (if you can) whenever you make changes to the way services work. That way you will be able to check everything is configured properly and all the required services will run the next time the server reboots for any reason.

Uninstall any software you do not need. For example, you will probably not need an X Window system on a web server, or the KDE, GNOME, and related programs.

Though desktop-related programs are mostly benign, you should uninstall some of the more dangerous tools such as compilers, network monitoring tools, and network assessment tools. In a properly run environment, a compiler on a host is not needed. Provided you standardize on an operating system, it is best to do development and compilation on a single development system and to copy the binaries (e.g., Apache) to the production systems from there.

Gathering Information and Monitoring Events

It is important to gather the information you can use to monitor the system or to analyze events after an intrusion takes place.

Note

Synchronize clocks on all servers (using the ntpdate utility). Without synchronization, logs may be useless.

Here are the types of information that should be gathered:

System statistics: Having detailed statistics of the behavior of the server is very important. In a complex network environment, a network management system (NMS) collects vital system statistics via the SNMP protocol, stores them, and acts when thresholds are reached. Having some form of an NMS is recommended even with smaller systems; if you can’t justify such an activity, the systat package will probably serve the purpose. This package consists of several binaries executed by cron to probe system information at regular intervals, storing data in binary format. The sar binary is used to inspect the binary log and produce reports. Learn more about sar and its switches; the amount of data you can get out if it is incredible. (Hint: try the -A switch.)
Integrity validation: Integrity validation software—also often referred to as host intrusion detection software—monitors files on the server and alerts the administrator (usually in the form of a daily or weekly report) whenever a change takes place. It is the only mechanism to detect a stealthy intruder. The most robust integrity validation software is Tripwire (http://www.tripwire.org). It uses public-key cryptography to prevent signature database tampering. Some integrity validation software is absolutely necessary for every server. Even a simple approach such as using the md5sum tool (which computes an MD5 hash for each file) will work, provided the resulting hashes are kept on a different computer or on a read-only media.
Process accounting: Process accounting enables you to log every command executed on a server (see Chapter 5).
Automatic log analysis: Except maybe in the first couple of days after installing your shiny new server, you will not review your logs manually. Therefore you must find some other way to keep an eye on events. Logwatch (http://www.logwatch.org) looks at the log files and produces an activity report on a regular basis (e.g., once a day). It is a modular Perl script, and it comes preinstalled on Red Hat systems. It is great to summarize what has been going on, and unusual events become easy to spot. If you want something to work in real time, try Swatch (http://swatch.sourceforge.net). Swatch and other log analysis programs are discussed in Chapter 8.

Securing Network Access

Though a network firewall is necessary for every network, individual hosts should have their own firewalls for the following reasons:

In case the main firewall is misconfigured, breaks down, or has a flaw
To protect from other hosts on the same LAN and from hosts from which the main firewall cannot protect (e.g., from an internal network)

On Linux, a host-based firewall is configured through the Netfilter kernel module (http://www.netfilter.org). In the user space, the binary used to configure the firewall is iptables. As you will see, it pays off to spend some time learning how Netfilter works. On a BSD system, ipfw and ipfilter can be used to configure a host-based firewall. Windows server systems have a similar functionality but it is configured through a graphical user interface.

Whenever you design a firewall, follow the basic rules:

Deny everything by default.
Allow only what is necessary.
Treat internal networks and servers as hostile and give them only minimal privileges.

What follows is an example iptables firewall script for a dedicated server. It assumes the server occupies a single IP address (192.168.1.99), and the office occupies a fixed address range 192.168.2.0/24. It is easy to follow and to modify to suit other purposes. Your actual script should contain the IP addresses appropriate for your situation. For example, if you do not have a static IP address range in the office, you may need to keep the SSH port open to everyone; in that case, you do not need to define the address range in the script.

#!/bin/sh
   
IPT=/sbin/iptables
# IP address of this machine
ME=192.168.1.99
# IP range of the office network
OFFICE=192.168.2.0/24
   
# flush existing rules
$IPT -F
   
# accept traffic from this machine
$IPT -A INPUT -i lo -j ACCEPT
$IPT -A INPUT -s $ME -j ACCEPT
   
# allow access to the HTTP and HTTPS ports
$IPT -A INPUT -m state --state NEW -d $ME -p tcp --dport 80 -j ACCEPT
$IPT -A INPUT -m state --state NEW -d $ME -p tcp --dport 443 -j ACCEPT
   
# allow SSH access from the office only
$IPT -A INPUT -m state --state NEW -s $OFFICE -d $ME -p tcp --dport 22 
-j ACCEPT
# To allow SSH access from anywhere, comment the line above and uncomment
# the line below if you don't have a static IP address range to use
# in the office
# $IPT -A INPUT -m state --state NEW -d $ME -p tcp --dport 22 -j ACCEPT
   
# allow related traffic
$IPT -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
   
# log and deny everything else
$IPT -A INPUT -j LOG
$IPT -A INPUT -j DROP

As you can see, installing a host firewall can be very easy to do, yet it provides excellent protection. As an idea, you may consider logging the unrelated outgoing traffic. On a dedicated server such traffic may represent a sign of an intrusion. To use this technique, you need to be able to tell what constitutes normal outgoing traffic. For example, the server may have been configured to download operating system updates automatically from the vendor’s web site. This is an example of normal (and required) outgoing traffic.

Note

If you are configuring a firewall on a server that is not physically close to you, ensure you have a way to recover from a mistake in firewall configuration (e.g., cutting yourself off). One way to do this is to activate a cron script (before you start changing the firewall rules) to flush the firewall configuration every 10 minutes. Then remove this script only after you are sure the firewall is configured properly.

Advanced Hardening

For systems intended to be highly secure, you can make that final step and patch the kernel with one of the specialized hardening patches:

grsecurity (http://www.grsecurity.net)
LIDS (http://www.lids.org)
Openwall (http://www.openwall.com/linux/)
Security-Enhanced Linux (SELinux) (http://www.nsa.gov/selinux/)

These patches will enhance the kernel in various ways. They can:

Enhance kernel auditing capabilities
Make the execution stack nonexecutable (which makes buffer overflow attacks less likely to succeed)
Harden the TCP/IP stack
Implement a mandatory access control (MAC) mechanism, which provides a means to restrict even root privileges
Perform dozens of other changes that incrementally increase security

I mention grsecurity’s advanced kernel-auditing capabilities in Chapter 5.

Some operating systems have kernel-hardening features built into them by default. For example, Gentoo supports grsecurity as an option, while the Fedora developers prefer SELinux. Most systems do not have these features; if they are important to you consider using one of the operating systems that support them. Such a decision will save you a lot of time. Otherwise, you will have to patch the kernel yourself. The biggest drawback of using a kernel patch is that you must start with a vanilla kernel, then patch and compile it every time you need to upgrade. If this is done without a clear security benefit, then the kernel patches can be a great waste of time. Playing with mandatory access control, in particular, takes a lot of time and nerves to get right.

To learn more about kernel hardening, see the following:

“Minimizing Privileges” by David A. Wheeler (http://www-106.ibm.com/developerworks/linux/library/l-sppriv.html)
“Linux Kernel Hardening” by Taylor Merry (http://www.sans.org/rr/papers/32/1294.pdf)

Keeping Up to Date

Maintaining a server after it has been installed is the most important thing for you to do. Because all software is imperfect and vulnerabilities are discovered all the time, the security of the software deteriorates over time. Left unmaintained, it becomes a liability.

The ideal time to think about maintenance is before the installation. What you really want is to have someone maintain that server for you, without you even having to think about it. This is possible, provided you:

Do not install software from source code.
Choose an operating system that supports automatic updates (e.g., Red Hat and SUSE server distributions) or one of the popular free operating systems that are promptly updated (Debian, Fedora, and others).

For most of the installations I maintain, I do the following: I install Apache from source, but I install and maintain all other packages through mechanisms of the operating system vendor. This is a compromise I can live with. I usually run Fedora Core on my (own) servers. Updating is as easy as doing the following, where yum stands for Yellowdog Updater Modified:

# yum update

If you are maintaining more than one server, it pays to create a local mirror of your favorite distribution and update servers from the local mirror. This is also a good technique to use if you want to isolate internal servers from the Internet.

Network Security

Another step backward from host security and we encounter network security. We will consider the network design a little bit later. For the moment, I will discuss issues that need to be considered in this context:

Firewall usage
Centralized logging
Network monitoring
External monitoring

A central firewall is mandatory. The remaining three steps are highly recommended but not strictly necessary.

Firewall Usage

Having a central firewall in front, to guard the installation, is a mandatory requirement. In most cases, the firewalling capabilities of the router will be used. A dedicated firewall can be used where very high-security operation is required. This can be a brand-name solution or a Unix box.

The purpose of the firewall is to enforce the site-access policy, making public services public and private services private. It also serves as additional protection for misconfigured host services. Most people think of a firewall as a tool that restricts traffic coming from the outside, but it can (and should) also be used to restrict traffic that is originating from inside the network.

If you have chosen to isolate application modules, having a separate IP address for each module will allow you to control access to modules directly on the firewall.

Do not depend only on the firewall for protection. It is only part of the overall protection strategy. Being tough on the outside does not work if you are weak on the inside; once the perimeter is breached the attacker will have no problems breaching internal servers.

Centralized Logging

As the number of servers grows, the ability to manually follow what is happening on each individual server decreases. The “standard” growth path for most administrators is to use host-based monitoring tools or scripts and use email messages to be notified of unusual events. If you follow this path, you will soon discover you are getting too many emails and you still don’t know what is happening and where.

Implementing a centralized logging system is one of the steps toward a solution for this problem. Having the logs at one location ensures you are seeing everything. As an additional benefit, centralization enhances the overall security of the system: if a single host on the network is breached the attacker may attempt to modify the logs to hide her tracks. This is more difficult when logs are duplicated on a central log server. Here are my recommendations:

Implement a central log server on a dedicated system by forwarding logs from individual servers.
Keep (and rotate) a copy of the logs on individual servers to serve as backup.
The machine you put your logs on becomes (almost) the most important machine on the network. To minimize the chances of it being breached, logging must be the only thing that machine does.

You will find that the syslog daemon installed by default on most distributions is not adequate for advanced configurations: it only offers UDP as a means of transport and does not offer flexible message routing. I recommend a modern syslog daemon such as syslog-ng (http://www.balabit.com/products/syslog_ng/). Here are its main advantages over the stock syslog daemon:

It supports reliable TCP-based logging.
It offers flexible message filtering capabilities.
It can combine reliable logging with other tools (such as Stunnel) to achieve encrypted delivery channels.

Network Monitoring

If you decide to implement central logging, that dedicated host can be used to introduce additional security to the system by implementing network monitoring or running an intrusion detection system. Intrusion detection is just another form of logging.

Network monitoring systems are passive tools whose purpose is to observe and record information. Here are two tools:

Ntop (http://www.ntop.org)
Argus (http://qosient.com/argus/)

Argus is easy to install, easy to run, and produces very compact logs. I highly recommend that you install it, even if it runs on the same system as your main (and only) web server. For in-depth coverage of this subject, I recommend Richard Bejtlich’s book The Tao of Network Security Monitoring: Beyond Intrusion Detection (Addison-Wesley).

Intrusion detection system (IDS) software observes and reacts to traffic-creating events. Many commercial and open source IDS tools are available. From the open source community, the following two are especially worth mentioning:

Snort (http://www.snort.org)
Prelude (http://www.prelude-ids.org)

Snort is an example of a network intrusion detection system (NIDS) because it monitors the network. Prelude is a hybrid IDS; it monitors the network (potentially using Snort as a sensor), but it also supports events coming from other types of sensors. Using hybrid IDS is a step toward a complete security solution.

The term intrusion prevention system (IPS) was coined to denote a system capable of detecting and preventing intrusion. IPS systems can, therefore, offer better results provided their detection mechanisms are reliable, avoiding the refusal of legitimate traffic.

Intrusion detection and HTTP

Since NIDSs are generic tools designed to monitor any network traffic, it is natural to attempt to use them for HTTP traffic as well. Though they work, the results are not completely satisfying:

Encrypted communication is mandatory for any secure web application, yet network-based intrusion detection tools do not cope with SSL well.
NIDS tools operate on the network level (more specifically, the packet level). Though many tools attempt to decode HTTP traffic to get more meaningful results there is an architectural problem that cannot be easily solved.

These problems have led to the creation of specialized network appliances designed to work as HTTP firewalls. Designed from the ground up with HTTP in mind, and with enough processing power, the two problems mentioned are neutralized. Several such systems are:

Axiliance Real Sentry (http://www.axiliance.com)
Breach (http://www.breach.com)
Imperva SecureSphere (http://www.imperva.com)
KaVaDo InterDo, http://www.kavado.com
NetContinuum (http://www.netcontinuum.com)
Teros Gateway, http://www.teros.com
WatchFire AppShield, http://www.watchfire.com

The terms web application firewall and application gateway are often used to define systems that provide web application protection. Such systems are not necessarily embedded in hardware only. An alternative approach is to embed a software module into the web server and to protect web applications from there. This approach also solves the two problems mentioned earlier: there is no problem with SSL because the module acts after the SSL traffic is decrypted and such modules typically operate on whole requests and responses, giving access to all of the features of HTTP.

In the open source world, mod_security is an embeddable web application protection engine. It works as an Apache module. Installed together with mod_proxy and other supporting modules on a separate network device in the reverse proxy mode of operation, it creates an open source application gateway appliance. The setup of a reverse proxy will be covered in the Section 9.4. Web intrusion detection and mod_security will be covered in Chapter 12.

External Monitoring

You will probably implement your own service monitoring in every environment you work in, using tools such as OpenNMS (http://www.opennms.org) or Nagios (http://www.nagios.org). But working from the inside gives a distorted picture of the network status. Ideally, the critical aspects of the operation should be regularly assessed from the outside (by independent parties). The following practices are recommended:

Performance monitoring: To measure the availability and performance of the network and every public service offered. Performance monitoring can easily be outsourced as there are many automated monitoring services out there.
Network security assessment: To confirm correct firewall configuration, spot misconfiguration, and note new hosts and services where there should be none.
Penetration testing: To test for vulnerabilities an attacker could exploit. Independent network penetration testing can be commissioned every few months or after significant changes in the network configuration.
Web security assessment: Specialized penetration testing to check for web application vulnerabilities.

Many security companies offer managed security through regular automated security scanning with a promise of manual analysis of changes and other suspicious results. These services are often a good value for the money.

Using a Reverse Proxy

A proxy is an intermediary communication device. The term “proxy” commonly refers to a forward proxy, which is a gateway device that fetches web traffic on behalf of client devices. We are more interested in the opposite type of proxy. Reverse proxies are gateway devices that isolate servers from the Web and accept traffic on their behalf.

There are two reasons to add a reverse proxy to the network: security and performance. The benefits coming from reverse proxies stem from the concept of centralization: by having a single point of entry for the HTTP traffic, we are increasing our monitoring and controlling capabilities. Therefore, the larger the network, the more benefits we will have. Here are the advantages:

Unified access control: Since all requests come in through the proxy, it is easy to see and control them all. Also known as a central point of policy enforcement.
Unified logging: Similar to the previous point, we need to collect logs only from one device instead of devising complex schemes to collect logs from all devices in the network.
Improved performance: Transparent caching, content compression, and SSL termination are easy to implement at the reverse proxy level.
Application isolation: With a reverse proxy in place, it becomes possible (and easy) to examine every HTTP request and response. The proxy becomes a sort of umbrella, which can protect vulnerable web applications.
Host and web server isolation: Your internal network may consist of many different web servers, some of which may be legacy systems that cannot be replaced or fixed when broken. Preventing direct contact with the clients allows the system to remain operational and safe.
Hiding of network topology: The more attackers know about the internal network, the easier it is to break in. The topology is often exposed through a carelessly managed DNS. If a network is guarded by a reverse proxy system, the outside world need not know anything about the internal network. Through the use of private DNS servers and private address space, the network topology can be hidden.

There are some disadvantages as well:

Increased complexity: Adding a reverse proxy requires careful thought and increased effort in system maintenance.
Complicated logging: Since systems are not accessed directly any more, the log files they produce will not contain the real client IP addresses. All requests will look like they are coming from the reverse proxy server. Some systems will offer a way around this, and some won’t. Thus, special care should be given to logging on the reverse proxy.
Central point of failure: A central point of failure is unacceptable in mission critical systems. To remove it, a high availability (HA) system is needed. Such systems are expensive and increase the network’s complexity.
Processing bottleneck: If a proxy is introduced as a security measure, it may become a processing bottleneck. In such cases, the need for increased security must be weighed against the cost of creating a clustered reverse proxy implementation.

Apache Reverse Proxy

The use of Apache 2 is recommended in reverse proxy systems. The new version of the mod_proxy module offers better support for standards and conforms to the HTTP/1.1 specification. The Apache 2 architecture introduces filters, which allow many modules to look at the content (both on the input and the output) simultaneously.

The following modules will be needed:

mod_proxy
mod_proxy_http: For basic proxying functionality
mod_headers: Manipulates request and response headers
mod_rewrite: Manipulates the request URI and performs other tricks
mod_proxy_html: Corrects absolute links in the HTML
mod_deflate: Adds content compression
mod_cache
mod_disk_cache
mod_mem_cache: Add content caching
mod_security: Implements HTTP firewalling

You are unlikely to need mod_proxy_connect, which is needed for forward proxy operation only.

Setting up the reverse proxy

Compile the web server as usual. Whenever the proxy module is used within a server, turn off the forward proxying operation:

# do not work as forward proxy
ProxyRequests Off

Not turning it off is a frequent error that creates an open proxy out of a web server, allowing anyone to go through it to reach any other system the web server can reach. Spammers will want to use it to send spam to the Internet, and attackers will use the open proxy to reach the internal network.

Two directives are needed to activate the proxy:

ProxyPass / http://web.internal.com/
ProxyPassReverse / http://web.internal.com/

The first directive instructs the proxy to forward all requests it receives to the internal server web.internal.com and to forward the responses back to the client. So, when someone types the proxy address in the browser, she will be served the content from the internal web server (web.internal.com) without having to know about it or access it directly.

The same applies to the internal server. It is not aware that all requests are executed through the proxy. To it the proxy is just another client. During normal operation, the internal server will use its real name (web.internal.com) in a response. If such a response goes to the client unmodified, the real name of the internal server will be revealed. The client will also try to use the real name for the subsequent requests, but that will probably fail because the internal name is hidden from the public and a firewall prevents access to the internal server.

This is where the second directive comes in. It instructs the proxy server to observe response headers, modify them to hide the internal information, and respond to its clients with responses that make sense to them.

Another way to use the reverse proxy is through mod_rewrite. The following would have the same effect as the ProxyPass directive above. Note the use of the P (proxy throughput) and L (last rewrite directive) flags.

RewriteRule ^(.+)$ http://web.internal.com/$1 [P,L]

mod_proxy_html

At this point, one problem remains: applications often generate and embed absolute links into HTML pages. But unlike the response header problem that gets handled by Apache, absolute links in pages are left unmodified. Again, this reveals the real name of the internal server to its clients. This problem cannot be solved with standard Apache but with the help of a third-party module, mod_proxy_html, which is maintained by Nick Kew. It can be downloaded from http://apache.webthing.com/mod_proxy_html/. It requires libxml2, which can be found at http://xmlsoft.org. (Note: the author warns against using libxml2 versions lower than 2.5.10.)

To compile the module, I had to pass the compiler the path to libxml2:

# apxs -Wc,-I/usr/include/libxml2 -cia mod_proxy_html.c

For the same reason, in the httpd.conf configuration file, you have to load the libxml2 dynamic library before attempting to load the mod_proxy_html module:

LoadFile /usr/lib/libxml2.so
LoadModule proxy_html_module modules/mod_proxy_html.so

The module looks into every HTML page, searches for absolute links referencing the internal server, and replaces them with links referencing the proxy. To activate this behavior, add the following to the configuration file:

# activate mod_proxy_html
SetOutputFilter proxy-html
   
# prevent content compression in backend operation
RequestHeader unset Accept-Encoding
   
# replace references to the internal server
# with references to this proxy
ProxyHTMLURLMap http://web.internal.com/ /

You may be wondering about the directive to prevent compression. If the client supports content decompression, it will state that with an appropriate Accept-Encoding header:

Accept-Encoding: gzip,deflate

If that happens, the backend server will respond with a compressed response, but mod_proxy_html does not know how to handle compressed content and it fails to do its job. By removing the header from the request, we force plaintext communication between the reverse proxy and the backend server. This is not a problem. Chances are both servers will share a fast local network where compression would not work to enhance performance.

Read Nick’s excellent article published in Apache Week, in which he gives more tips and tricks for reverse proxying:

“Running a Reverse Proxy With Apache” by Nick Kew (http://www.apacheweek.com/features/reverseproxies)

There is an unavoidable performance penalty when using mod_proxy_html. To avoid unnecessary slow down, only activate this module when a problem with absolute links needs to be solved.

Reverse Proxy by Network Design

The most common approach to running a reverse proxy is to design it into the network. The web server is assigned a private IP address (e.g., 192.168.0.1) instead of a real one. The reverse proxy gets a real IP address (e.g., 217.160.182.153), and this address is attached to the domain name (which is www.example.com in the following example). Configuring Apache to respond to a domain name by forwarding requests to another server is trivial:

<VirtualHost www.example.com>
    ProxyPass / http://192.168.0.1/
    ProxyPassReverse / http://192.168.0.1/
   
    # additional mod_proxy_html configuration 
    # options can be added here if required
</VirtualHost>

Reverse Proxy by Redirecting Network Traffic

Sometimes, when faced with a network that is already up and running, it may be impossible or too difficult to reconfigure the network to introduce a reverse proxy. Under such circumstances you may decide to introduce the reverse proxy through traffic redirection on a network level. This technique is also useful when you are unsure about whether you want to proxy, and you want to see how it works before committing more resources.

The following steps show how a transparent reverse proxy is introduced to a network, assuming the gateway is capable of redirecting traffic:

The web server retains its real IP address. It will be unaware that traffic is not coming to it directly any more.
A reverse proxy is added to the same network segment.
A firewall rule is added to the gateway to redirect the incoming web traffic to the proxy instead of to the web server.

The exact firewall rule depends on the type of gateway. Assuming the web server is at 192.168.1.99 and the reverse proxy is at 192.168.1.100, the following iptables command will transparently redirect all web server traffic through the proxy:

# iptables -t nat -A PREROUTING -d 192.168.1.99 -p tcp --dport 80 \
> -j DNAT --to 192.168.1.100

Network Design

A well-designed network is the basis for all other security efforts. Though we are dealing with Apache security here, our main subject alone is insufficient. Your goal is to implement a switched, modular network where services of different risk are isolated into different network segments.

Figure 9-1 illustrates a classic demilitarized zone (DMZ) network architecture.

Figure 9-1. Classic DMZ architecture

This architecture assumes you have a collection of backend servers to protect and also assumes danger comes from one direction only, which is the Internet. A third zone, DMZ, is created to work as an intermediary between the danger outside and the assets inside.

Ideally, each service should be isolated onto its own server. When circumstances make this impossible (e.g., financial reasons), try not to combine services of different risk levels. For example, combining a public email server with an internal web server is a bad idea. If a service is not meant to be used directly from the outside, moving it to a separate server would allow you to move the service out of the DMZ and into the internal LAN.

For complex installations, it may be justifiable to create classes of users. For example, a typical business system will operate with:

Public users
Partners (extranet)
Internal users (intranet)

With proper planning, each of these user classes can have its own DMZ, and each DMZ will have different privileges with regards to access to the internal LAN. Multiple DMZs allow different classes of users to access the system via different means. To participate in high-risk systems, partners may be required to access the network via a virtual private network (VPN).

To continue to refine the network design, there are four paths from here:

Network hardening: General network-hardening elements can be introduced into the network to make it more secure. They include things such as dedicated firewalls, a central logging server, intrusion detection systems, etc.
Use of a reverse proxy: A reverse proxy, as discussed elsewhere in this chapter, is a versatile tool for managing HTTP networks. It offers many benefits with only slight drawbacks. Reverse proxy patterns will be considered in detail here.
Commercial application gateways: An application gateway is a security-conscious reverse proxy. You can create an application gateway out of freely available components, but it is generally not possible to achieve the same level of features as offered by commercial offerings. In the long run, open source tools may catch up; for the time being, commercial application gateways should be considered as a final protection layer if the budget allows it.
Scalability and availability improvements: High security networks are likely to host mission-critical systems. Such systems often have specific scalability and availability requirements. (In Section 9.5.2, I discuss some of the approaches as to how these requirements can be accommodated.)

Reverse Proxy Patterns

So far I have discussed the mechanics of reverse proxy operation. I am now going to describe usage patterns to illustrate how and why you might use the various types of reverse proxies on your network. Reverse proxies are among the most useful tools in HTTP network design. None of their benefits are HTTP-specific—it is just that HTTP is what we are interested in. Other protocols benefit from the same patterns I am about to describe.

The nature of patterns is to isolate one way of doing things. In real life, you may have all four patterns discussed below combined onto the same physical server.

For additional coverage of this topic, consider the following resources:

“Reverse Proxy Patterns” by Peter Sommerlad (http://www.modsecurity.org/archive/ReverseProxy-book-1.pdf)
“Perimeter Defense-in-Depth: Using Reverse Proxies and other tools to protect our internal assets“ by Lynda L. Morrison (http://www.sans.org/rr/papers/35/249.pdf)

Front door

The front door reverse proxy pattern should be used when there is a need to implement a centralized access policy. Instead of allowing external users to access web servers directly, they are directed through a proxy. The front-door pattern is illustrated in Figure 9-2.

Figure 9-2. Front door reverse proxy

This pattern has two benefits:

Single point to enforce access policy
Centralized logging

The front door reverse pattern is most useful in loose environments; for example, those of software development companies where developers have control over development servers. Allowing clients to access the applications as they are being developed is often necessary. Firewalls often do not offer enough granularity for giving privileges, and having an unknown number of servers running on a network is very bad for security.

Integration reverse proxy

The configuration of an integration reverse proxy, illustrated in Figure 9-3, is similar to that of a front door pattern, but the purpose is completely different. The purpose of the integration reverse proxy is to integrate multiple application parts (often on different servers) into one unique application space. There are many reasons for doing this:

Single Sign On (SSO).
Increased configuration flexibility (changes can be made to the system without affecting its operation).
Decoupling of application modules; this is possible due to the introduced abstraction.
Improved scalability and availability. For example, it is easy to replace a faulty system.

Figure 9-3. Integration reverse proxy

Basically, this pattern allows a messy configuration that no one wants to touch to be transformed into a well-organized, secured, and easy-to-maintain system.

There are two ways to use this pattern. The obvious way is to hide the internal workings of a system and present clients with a single server. But there is also a great benefit of having a special internal integration proxy to sort out the mess inside.

In recent years there has been a lot of talk about web services. Systems are increasingly using port 80 and the HTTP protocol for internal communication as a new implementation of remote procedure calling (RPC). Technologies such as REST, XML-RPC, and SOAP (given in the ascending level of complexity) belong to this category.

Allowing internal systems to communicate directly results in a system where interaction is not controlled, logged, or monitored. The integration reverse proxy pattern brings order.

Protection reverse proxy

A protection reverse proxy, illustrated in Figure 9-4, greatly enhances the security of a system:

Internal servers are no longer exposed to the outside world. The pattern introduces another layer of protection for vulnerable web servers and operating systems.
Network topology remains hidden from the outside world.
Internal servers can be moved out of the demilitarized zone.
Vulnerable applications can be protected by putting an HTTP firewall on the reverse proxy.

Figure 9-4. Protection reverse proxy

The protection reverse proxy is useful when you must maintain an insecure, proprietary, or legacy system. Direct exposure to the outside world could lead to a compromise, but putting such systems behind a reverse proxy would extend their lifetime and allow secure operation. A protection reverse proxy can also actually be useful for all types of web applications since they can benefit from having an HTTP firewall in place, combined with full traffic logging for auditing purposes.

Performance reverse proxy

Finally, you have a good reason to introduce a reverse proxy to increase overall system performance. With little effort and no changes to the actual web server, a reverse proxy can be added to perform the following operations (as seen in Figure 9-5):

SSL termination, such that SSL communication is terminated at the proxy and the traffic continues unencrypted to the web server
Caching
Compression

Figure 9-5. Performance reverse proxy

Moving these operations to the separate server frees the resources on the web server to process requests. Moreover, the web server (or the application) may not be able to support these operations. Because the reverse proxy operates on the HTTP level, the additional functionality can be introduced in front of a web server of any type.

Advanced Architectures

There are three reasons why you would concern yourself with advanced HTTP architectures:

You want to achieve higher availability. Having a system down while the server is being repaired is unacceptable.
The number of users is likely to be greater than one server can support, or is likely to grow (so you desire scalability).
That cool security reverse proxy you put in place centralizes HTTP requests, and you have to deal with the resulting bottleneck in the system.

It would be beneficial to define relevant terms first (this is where Wikipedia, http://www.wikipedia.org, becomes useful):

Scalability: The ability of a system to maintain performance under increased load by adding new resources (e.g., hardware).
Availability: The percent of the time a system is functioning properly during a given time period.
Fault tolerance: The ability of a system to continue to function in spite of failure of its components.
High availability: The ability of a system to function continuously, achieving high availability rates (e.g., 99.999%).
Load balancing: The distribution of the system load across several components, in order to utilize all available resources.
Failover: A backup operation that automatically changes the system to reroute its operation around a faulty component.
Mirroring: The creation of a redundant copy of a component, which can replace the original component in case of a failure. A redundant copy in a mirrored system is often working in stand-by; it starts operating only after a failure in the mirrored component occurs. If both components operate simultaneously, the term cluster is more appropriate.
Clustering: A configuration of components that makes them appear as a single component from the outside. Clusters are built to increase availability and scalability by introducing fault tolerance and load balancing.

We will cover the advanced architectures as a journey from a single-server system to a scalable and highly available system. The application part of the system should be considered during the network design phase. There are too many application-dependent issues to leave them out of this phase. Consult the following for more information about application issues related to scalability and availability:

“Scalable Internet Architectures” by George Schlossnagle and Theo Schlossnagle (http://www.omniti.com/~george/talks/LV736.ppt)
“Inside LiveJournal’s Backend” by Brad Fitzpatrick (http://www.danga.com/words/2004_mysqlcon/)
“Web Search for a Planet: The Google Cluster Architecture” by Luiz Andre Barroso et al. (http://www.computer.org/micro/mi2003/m2022.pdf)
“The Google Filesystem” by Sanjay Ghemawat et al. (http://www.cs.rochester.edu/sosp2003/papers/p125-ghemawat.pdf)

The following sections describe various advanced architectures.

No load balancing, no high availability

At the bottom of the scale we have a single-server system. It is great if such a system works for you. Introducing scalability and increasing availability of a system involves hard work, and it is usually done under pressure and with (financial) constraints.

So, if you are having problems with that server, you should first look into ways to enhance the system without changing it too much:

Determine where the processing bottleneck is. This will ensure you are addressing the real problem.
Tune the operating system. Tune hard-disk access and examine memory requirements. Add more memory to the system because you can never have too much.
Tune the web server to make the most out of available resources (see Chapter 5).
Look for other easy solutions. For example, if you are running PHP, having an optimization module (which caches compiled PHP scripts) can increase your performance several times and lower the server load. There are many free solutions to choose from. One of them, mmCache (http://turck-mmcache.sourceforge.net) is considered to be as good as commercially available solutions.
Perform other application-level tuning techniques (which are beyond the scope of this book).

Note

John Lim of PHP Everywhere maintains a detailed list of 34 steps to tune a server running Apache and PHP at http://phplens.com/phpeverywhere/tuning-apache-php.

If you have done all of this and you are still on the edge of the server’s capabilities, then look into replacing the server with a more powerful machine. This is an easy step because hardware continues to improve and drop in price.

The approach I have just described is not very scalable but is adequate for many installations that will never grow to require more than one machine. There remains a problem with availability—none of this will increase the availability of the system.

High availability

A simple solution to increase availability is to introduce resource redundancy by way of a server mirror (illustrated in Figure 9-6). Create an exact copy of the system and install software to monitor the operations of the original. If the original breaks down for any reason, the mirrored copy becomes active and takes over. The High-Availability Linux Project (http://linux-ha.org) describes how this can be done on Linux.

Figure 9-6. Two web servers in a high availability configuration

A simple solution such as this has its drawbacks:

It does not scale well. For each additional server you want to introduce to the system, you must purchase a mirror server. If you do this a couple of times, you will have way too much redundancy.
Resources are being wasted because mirrored servers are not operational until the fault occurs; there is no load balancing in place.

Manual load balancing

Suppose you have determined that a single server is not enough to cope with the load. Before you jump to creating a cluster of servers, you should consider several crude but often successful techniques that are referred to as manual load balancing. There are many sites happily working like this. Here are three techniques you can use:

Separate services onto different servers. For example, use one machine for the web server and the other for the database server.
Separate web servers into groups. One group could serve images, while the other serves application pages. Even with only one machine, some people prefer to have two web servers: a “slim” one for static files and a “heavy” one for dynamic pages. Another similar approach is to split the application into many parts, but this does not result in an easily maintainable system.
Add a performance reverse proxy in front of the server.

So, we can handle a load increase up to a certain point this way but we are worse off from the availability point of view. More machines in a system translate into more points of failure. Still, if some downtime is acceptable, then standardizing on the hardware and keeping a spare machine at all times should keep you going.

DNS Round Robin (DNSRR) load balancing

A cluster of servers (see Figure 9-7) provides scalability, high availability, and efficient resource utilization (load balancing). First, we need to create a cluster. An ideal cluster consists of N identical servers, called (cluster) nodes. Each node is capable of serving a request equally well. To create consistency at the storage level, one of the following strategies can be used:

Install nodes from a single image and automate maintenance afterward.
Boot nodes from the network. (Such nodes are referred to as diskless nodes.)
Use shared storage. (This can be a useful thing to do, but it can be expensive and it is a central point of failure.)
Replicate content (e.g., using rsync).
Put everything into a database (optionally clustering the database, too).

Figure 9-7. DNS Round Robin cluster

After creating a cluster, we need to distribute requests among cluster nodes. The simplest approach is to use a feature called DNS Round Robin (DNSRR). Each node is given a real IP address, and all IP addresses are associated with the same domain name. Before a client can make a request, it must resolve the domain name of the cluster to an IP address. The following query illustrates what happens during the resolution process. This query returns all IP addresses associated with the specified domain name:

$ dig www.cnn.com
   
; <<>> DiG 9.2.1 <<>> www.cnn.com
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 38792
;; flags: qr rd ra; QUERY: 1, ANSWER: 9, AUTHORITY: 4, ADDITIONAL: 4
   
;; QUESTION SECTION:
;www.cnn.com.                   IN      A
   
;; ANSWER SECTION:
www.cnn.com.            285     IN      CNAME   cnn.com.
cnn.com.                285     IN      A       64.236.16.20
cnn.com.                285     IN      A       64.236.16.52
cnn.com.                285     IN      A       64.236.16.84
cnn.com.                285     IN      A       64.236.16.116
cnn.com.                285     IN      A       64.236.24.4
cnn.com.                285     IN      A       64.236.24.12
cnn.com.                285     IN      A       64.236.24.20
cnn.com.                285     IN      A       64.236.24.28

Here you can see the domain name www.cnn.com resolves to eight different IP addresses. If you repeat the query several times, you will notice the order in which the IP addresses appear changes every time. Hence the name “round robin.” Similarly, during domain name resolution, each client gets a “random” IP address from the list. This leads to the total system load being distributed evenly across all cluster nodes.

But what happens when a cluster node fails? The clients working with the node have already resolved the name, and they will not repeat the process. For them, the site appears to be down though other nodes in the cluster are working.

One solution for this problem is to dynamically modify the list of IP addresses in short intervals, while simultaneously shortening the time-to-live (TTL, the period during which DNS query results are to be considered valid).

If you look at the results of the query for www.cnn.com, the TTL is set to 285 seconds. In fact, CNN domain name servers regenerate the list every five minutes. When a node fails, its IP address will not appear on the list until it recovers. In that case, one portion of all clients will experience a downtime of a couple of minutes.

This process can be automated with the help of Lbnamed, a load-balancing name server written in Perl (http://www.stanford.edu/~schemers/docs/lbnamed/lbnamed.html).

Another solution is to keep the DNS static but implement a fault-tolerant cluster of nodes using Wackamole (http://www.backhand.org/wackamole/). Wackamole works in a peer-to-peer fashion and ensures that all IP addresses in a cluster remain active. When a node breaks down, Wackamole detects the event and instructs one of the remaining nodes to assume the lost IP address.

The DNSRR clustering architecture works quite well, especially when Wackamole is used. However, a serious drawback is that there is no place to put the central security reverse proxy to work as an application gateway.

Management node clusters

A different approach to solving the DNSRR node failure problem is to introduce a central management node to the cluster (Figure 9-8). In this configuration, cluster nodes are given private addresses. The system as a whole has only one IP address, which is assigned to the management node. The management node will do the following:

Monitor cluster nodes for failure
Measure utilization of cluster nodes
Distribute incoming requests

Figure 9-8. Classic load balancing architecture

To avoid a central point of failure, the management node itself is clustered, usually in a failover mode with an identical copy of itself (though you can use a DNSRR solution with an IP address for each management node).

This is a classic high-availability/load-balancing architecture. Distribution is often performed on the TCP/IP level so the cluster can work for any protocol, including HTTP (though all solutions offer various HTTP extensions). It is easy, well understood, and widely deployed. The management nodes are usually off-the-shelf products, often quite expensive but quite capable, too. These products include:

Foundry Networks ServerIron (http://www.foundrynet.com/products/webswitches/serveriron/)
F5 Networks BigIP (http://www.f5.com/f5products/bigip/)
Cisco LocalDirector (http://www.cisco.com/warp/public/cc/pd/cxsr/400/)

An open source alternative for Linux is the Linux Virtual Server project (http://www.linuxvirtualserver.org). It provides tools to create a high availability cluster (or management node) out of cheap commodity hardware.

Reverse proxy clusters

Reverse proxy clusters are the same in principle as management node clusters except that they work on the HTTP level and, therefore, only for the HTTP protocol. This type of proxy is of great interest to us because it is the only architecture that allows HTTP firewalling. Commercial solutions that work as proxies are available, but here we will discuss an open source solution based around Apache.

Ralf S. Engelschall, the man behind mod_rewrite, was the first to describe how reverse proxy load balancing can be achieved using mod_rewrite:

“Website Balancing, Practical approaches to distributing HTTP traffic” by Ralf S. Engelschall (http://www.webtechniques.com/archives/1998/05/engelschall/)

First, create a script that will create a list of available cluster nodes and store it in a file servers.txt:

# a list of servers to load balance
www www1|www2|www3|www4

The script should be executed every few minutes to regenerate the list. Then configure mod_rewrite to use the list to redirect incoming requests through the internal proxy:

RewriteMap servers rnd:/usr/local/apache/conf/servers.txt
RewriteRule ^/(.+)$ ${servers:www} [P,L]

In this configuration, mod_rewrite is smart enough to detect when the file servers.txt changes and to reload the list. You can configure mod_rewrite to start an external daemon script and communicate with it in real time (which would allow us to use a better algorithm for load distribution).

With only a couple of additional lines added to the httpd.conf configuration file, we have created a reverse proxy. We can proceed to add features to it by adding other modules (mod_ssl, mod_deflate, mod_cache, mod_security) to the mix. The reverse proxy itself must be highly available, using one of the two methods we have described. Wackamole peer-to-peer clustering is a good choice because it allows the reverse proxy cluster to consist of any number of nodes.

An alternative to using mod_rewrite for load balancing, but only for the Apache 1.x branch, is to use mod_backhand (http://www.backhand.org/mod_backhand/). While load balancing in mod_rewrite is a hack, mod_backhand was specifically written with this purpose in mind.

This module does essentially the same thing as mod_rewrite, but it also automates the load balancing part. An instance of mod_backhand runs on every backend server and communicates with other mod_backhand instances. This allows the reverse proxy to make an educated judgment as to which of the backend servers should be handed the request to process. With mod_backhand, you can easily have a cluster of very different machines.

Only a few changes to the Apache configuration are required. To configure a mod_backhand instance to send status to other instances, add the following (replacing the specified IP addresses with ones suitable for your situation):

# the folder for interprocess communication
UnixSocketDir /usr/local/apache/backhand
# multicast data to the local network
MulticastStats 192.168.1.255:4445
# accept resource information from all hosts in the local network
AcceptStatus 192.168.1.0/24

To configure the reverse proxy to send requests to backend servers, you need to feed mod_backhand a list of candidacy functions. Candidacy functions process the server list in an attempt to determine which one server is the best candidate for the job:

# byAge eliminates servers that have not
# reported in the last 20 seconds
Backhand byAge
# byLoad reorders the server list from the
# least loaded to the most loaded
Backhand byLoad

Finally, on the proxy, you can configure a handler to access the mod_backhand status page:

<Location /backhand/>
    SetHandler backhand-handler
</Location>