Testing DHCP failover with logging: A VMware case study

PDF version on this page

Testing DHCP failover with logging

A VMware case study

Hans Liss <Hans@Liss.pp.se> April 2009


Introduction

The hastily named yet carefully designed 'Gluff' system is based on an initial idea by Magnus Törnros, to solve the problem of getting a proper audit log of DHCP leases, complete with device and port info from the Option-82 data provided by many switches. More info on Gluff can be found at http://hans.liss.pp.se/software/gluff

In order to get this system working, I needed a test rig. It would need a number of client computers to produce DHCP leases as well as a number of servers, and it was obviously not feasible to set up a large number of physical computers. Furthermore, I didn't want to compromise and run several services in parallel on just a few computers where they would be run on separate computers in the real setup. VMware solved this for me.

Using VMware Workstation, I was able to successfully set up a virtual test rig on a single Windows XP Home computer. The test rig consists of a gateway, two DHCP servers, a log server, a “port permuter” and six clients – all in all 11 different virtual computers.

This paper describes this setup in detail.


Overall setup

The real-life counterpart of this setup would include routers, a log server, two DHCP servers in a failover setup, and a large number of switches for client networks in a wide-area network. I needed to model the important parts of this setup, but I didn't need to replicate it exactly.

Although using more than one switch would have added to the realism, and provided more possible values for the Option-82 Agent ID, it would also have meant adding more network cards and several more physical and virtual networks. I decided against that added complexity, since there wouldn't be any significant added benefit of getting different Agent ID:s. The final setup is shown in the diagram in Appendix A.

Network plan

The network plan is pretty simple. Logically, there is a single client net, using 192.168.100.0/24. The gateway is directly connected to this network, so 192.168.100.1 is the default gateway on the clients.The Cisco switch, ClientSwitch, has the address 192.168.100.2, and the Autobridge server is available on 192.168.100.3.

There are three other segments connected to the gateway: Server net 1 (192.168.10.0/24) and Server net 2 (192.168.11.0/24) are used for the DHCP servers, while Server net 3 (192.168.15.0/24) is used for the Logger. The gateway is node 1 on all three nets, and the (single) server on each segment is node 10.

There is a total of eleven network segments in the setup, of which five are physical and the rest virtual.

VMware setup

I set the whole thing up using VMware Workstation's ”team” system, which enabled me to start the whole set of machines up with a single command, with a predefined delay between each guest, as well as define a team-specific virtual network design and get a decent overview of the network assignments within the team. The same thing can probably be done in VMware Server, even though the ”team” concept isn't supported there.

Basic team configuration

The time delays between starting each virtual machine can be specified on the Virtual Machines tab:

Virtual networks

The virtual networks are configured in the Team Settings dialog, on the LAN Segments tab:

Physical networks

The physical networks are configured in the Virtual Network Editor, reached from the Edit menu. The main challenge here is to correctly identify each port, when they all look more or less alike. Note that there is an extra 21143-based card in this computer, which is not used in this particular setup:



The interface designations correspond to what you can see in the Control Panel's Network Connections view:



Connecting it all together

The actual network connections for each virtual machine can be configured in the machine settings, or on the Team Settings dialog's Connections tab:


Any connections to physical interfaces (using the already configured VMnet adapters) need to be configured in the virtual machine's own setting, however.

Console view

VMware Workstation's Console View enables easy access to each virtual machine within the team, as well as a live thumbnail gallery of all the machines. However, once all network connections work, it's obviously far better to use an SSH client to connect to the Gateway, and from there to the other machines.

Virtual machine configuration

Ubuntu Server 8.04i was chosen as the operating system for all of the virtual computers, since it is a “mainstream” Linux distribution with a mature packaging system. Ubuntu Server is very easy to set up and doesn't have a very large footprint. However, it doesn't seem to run with less than 40MB of RAM, which could be important in a setup like this.

The system on which the test rig is running has 2GB of memory and can run the test rig pretty comfortably in that.

Eleven virtual computers were built, by setting up the first one and then cloning it. Before the original server was cloned, a passphraseless SSH key was created and also added to its own .ssh/authorized_keys file for the root user. This way, all the virtual computers can easily connect to each other.

The six clients have nothing special installed on them, but the servers all have development packages (gcc, build-essential etc) installed, since I needed to compile dhcp and my own code on them.

Any servers that need to connect to outside systems using SSH have an extra SSH key with a passphrase. This is needed for CVS, among other things.

All the virtual computers have a root user with the password ”root”, to make it easy to log in on the VMware virtual console. However, they have no telnet service configured, and sshd does not accept password authentication.

The gateway

The gateway is the only system that has a bridged connection to the physical LAN, and it can work as a NAT address translator for external connections.

This system has five network connections:

  • eth0 is the uplink, bridged to the physical LAN

  • eth1 and eth2 are connected to the DHCP servers

  • eth3 is a separate segment for the Logger

  • eth4 is the internal “backbone” for the client network.

The gateway uses ISC's dhcrelay to relay DHCP requests from the internal backbone to the two DHCP servers. dhcrelay isn't very advanced and in fact it needs to be told to listen to both eth1, eth2 and eth4, because otherwise it just ignores the requests or replies. dhcrelay is started using the local startup script, called /etc/rc.local. This script by default does nothing in Ubuntu, but here it contains the following commands:, using this command:

/usr/local/sbin/dhcrelay -i eth1 -i eth2 -i eth4 -m forward 192.168.11.10 192.168.10.10

The DHCP servers

The two DHCP server systems are identical, apart from the DHCP configuration files. They both run an ISC DHCP serverii, version 4.1.0, patched to provide Gluff with lease/release events in the form of a queue table in an sqlite3 database. They also both run the Gluff binary itself, which is connected to the MySQL server on the Logger system.

Since the software being tested is not package-based and needs to be recompiled regularly, they both use an /etc/rc.local startup script for dhcpd and gluff:

/usr/local/sbin/dhcpd -cf /usr/local/etc/dhcpd.conf -ldb /var/db/dhcpd_queue.db3 eth0
/opt/gluff/bin/gluff -l /var/db/dhcpd_queue.db3 -h 192.168.15.10 -udhcpd -pfoobar -ddhcpd_leases -R

A peculiarity with ISC DHCP is that when the software is installed using ”make install”, it overwrites the existing configuration file. This means I need to keep a copy of these files and restore them after installing the software, each time I make a change.

The two DHCP servers are configured in a failover setup. The primary has this in the configuration file:

failover peer "dhcp-failover" {

primary;

address 192.168.10.10;

port 647;

peer address 192.168.11.10;

peer port 647;

max-response-delay 10;

max-unacked-updates 3;

load balance max seconds 2;

mclt 1800;

split 128;

}


subnet 192.168.10.0 netmask 255.255.255.0 {

}


subnet 192.168.11.0 netmask 255.255.255.0 {

}


subnet 192.168.100.0 netmask 255.255.255.0 {

pool {

failover peer "dhcp-failover";

range 192.168.100.50 192.168.100.60;

option routers 192.168.100.1;

}

}


The secondary server has a corresponding block in its configuration file:

failover peer "dhcp-failover" {

secondary;

address 192.168.11.10;

port 647;

peer address 192.168.10.10;

peer port 647;

max-response-delay 10;

max-unacked-updates 3;

load balance max seconds 2;

}


subnet 192.168.10.0 netmask 255.255.255.0 {

}


subnet 192.168.11.0 netmask 255.255.255.0 {

}


subnet 192.168.100.0 netmask 255.255.255.0 {

pool {

failover peer "dhcp-failover";

range 192.168.100.50 192.168.100.60;

option routers 192.168.100.1;

}

}


DHCP lease times were set very short to trigger frequent lease updates for testing purposes.


It's a good idea to set up NTP to keep the DHCP servers' clocks in sync. This also applies to the gateway (running dhcrelay) as well as the Logger. Please note that VMware recommends either letting VMware Tools do the time synchronization or running NTP, but never both at the same time!iii

The Logger

The logger is a simple Ubuntu server running a MySQL database accessible from the network. This is used by Gluff to store the DHCP lease audit log. There is nothing else on this system except for the command-line MySQL client, to be able to check the contents of the database. However, this server could also host an Apache/PHP system to build and test a management interface.

Option-82 switch information – the physical switch

One of the main points of the Gluff system was to record switch and switch port information along with the lease information. Option-82iv can be used to add this information to the DHCP request payload, and ISC's dhcpd can use this information in several ways. For the test rig, I needed proper Option-82 info to be produced.

For the Option-82 information that I needed, I first tried to use ISC's “dhcrelay” with its option for agent/circuit info, but I had little success in replicating the real case, which was based on Cisco switches in a bridging setup, rather than actually relaying DHCP requests. In the end, we decided that I needed to add a physical Cisco switch into the loop.

A Cisco 2950 series switch was configured and connected via four of its ports to a four-port network card in the computer, which provided an uplink and three downstream ports on which to connect client computers. Incidentally, this opened up the possibility of attaching more physical clients directly to the switch, if needed.

The switch was configured to add Option-82 information to forwarded DHCP requestsv. The global commands for this are

ip dhcp snooping

ip dhcp snooping vlan 1


Enabling DHCP snooping will make the switch refuse to forward DHCP requests, so in order to get it to work again, use the following command on the uplink interface, to make the switch ”trust” that interface for DHCP:

ip dhcp snooping trust

Autobridge, the port permuter

Six basic client computers were set up, and they were connected to internal, virtual networks in VMware Workstation. However, I needed to replicate the case where client computers get moved between different switch ports, since this is the main reason for audition leases along with port information.

To this end, I set up a “port permuter”, Autobridge, to move the clients between the three physical ports on the switch. In VMware, I set up six virtual networks, three of which were connected to the three downstream switch ports and to autobridge, and the other three were connected between autobridge and the clients. Note that I could easily have set up any number of downstream ports, it just happened to be three.

Autobridge is configured to connect its six interfaces with three bridges, mapping the upstream ports randomly to downstream ports. It rebuilds this mapping regularly, in effect moving the clients to different ports every time and forcing Gluff to produce a new lease entry in the database. The script used for this is included in Appendix B. The script is run by cron, every 20 minutes.

The Clients

Six clients were built, and they are all very basic Ubuntu installs. All they need to do is sit there and get a lease through DHCP, and to renew that lease when necessary. They also have scripts for restarting/reloading the network configuration, initiated by the Gateway.

Management by SSH

With a large number of hosts to manage, it's very useful to be able to do oft-repeated operations from a central point. In the old days, this was done using the /etc/hosts.equiv file and the rsh command, but the modern, much more secure, alternative is using SSH. OpenSSH permits passphraseless keys that make it possible to use SSH connections to other systems extensively in scripts.

Be very careful when using this trick, though. Make sure you protect keys like this very carefully, try to restrict SSH access so only known hosts can connect, and never use these keys where you don't need them.

In a real-life situation, you could use the same method if you designate a particularily secure system as the management host, make sure access to this host is severely restricted, and never use the single passphraseless key from this system anywhere else than on the managed hosts.

In the virtual test rig, these considerations aren't anywhere near as important. I initially set up all the virtual machines to accept connections from any of the other systems, using the same passphraseless key everywhere. In reality, all of the connections, with few exceptions, are initiated on the Gateway.

The servers' IP addresses are fixed, of course. They are listed in a test file called hosts.txt, for convenience.

Accessing all the clients is a bit more tricky since they keep changing their IP addresses. I made a set of scripts that enable the Gateway to connect to the two DHCP servers and extract lease information from their DHCP lease databases, /var/db/dhcpd.leases. Refer to Appendix C, listleases.sh (1) and (2).

I also needed a simple way to safely take the whole team down, and because of the network design, this has to be done in a certain order. Refer to Appendix C, team_stop.sh.

Any time I need to restart the lease process from scratch, I have to reset all the clients' interfaces, stop dhcpd and gluff on the DHCP servers, remove the lease databases and then restart everything. To this end, I built a set of scripts that could initiate a delayed network restart on the clients, and restart the DHCP servers from scratch. Refer to Appendix C, clients_nwrestart_slow.sh and dhcp_reset.sh for details.

Running tests

The test rig has been used for regression tests of the Gluff system including the patched ISC DHCP server software. No formal test protocol has been used, but the end result of the setup, in the form of data in the MySQL database, is a simple indication of the functionality. By running long-term tests with short lease times and the Autobridge constantly rearranging the clients, a strong indication of stability could be aquired. Monitoring memory consumption will also reveal most memory leaks when doing long-term automated tests.

Adding a formal test protocol to this would be perfectly feasible, and it could even be set up to do automatic regression testing, by accumulating the lease data through some kind of secondary channel (i.e. scripts collating log data from the syslog to produce the same information and then automatically comparing the two.)

Conclusions

Setting up a test rig, even for relatively complex setups, has never been easier. Using VMware for this kind of thing not only saves on hardware, noise, space and potential hardware-related problems, it also allows you to quickly set up and clone the computers without having to do full installs on physical computers. In short, it allows you to concentrate on the important issues.

This particular example demonstrates a configuration where a physical device is included in the test rig, even though the majority of test systems are virtual. It also demonstrates a complex test being done on a mainstream workstation running a home operating system.

A setup like this can easily be kept running for weeks without taking up unnecessary space, producing noise or consuming significant amounts of power.

The entire ”DHCP Test” VMware Team takes up just over 10GB of disk space, so it can even be archived or moved around – even on a modern USB stick – easily whenever needed.

No serious VMware-related issues were encountered during this experiment. Apart from the need for quite a lot of RAM and the network interface identification issues, no VMware-related problems were encountered and VMware did its job amiably.

My conclusion is that VMware can be of immense help when a complex test setup is needed. The only situations where VMware wouldn't be optimal for testing are when speed or timing are important considerations, or when special hardware is used which can't be safely used with virtual machines.


References


Appendix A

Overall structure and IP addressing

 

Virtual Test Rig overview


Appendix B

setupbridge.sh”, the Autobridge script

 

#!/bin/sh


DO=


export PATH=$PATH:/sbin:/usr/sbin


TEMPFILE1=`mktemp`

TEMPFILE2=`mktemp`


echo "Setupbridge `date`"


$DO ifconfig bridge01 down

$DO ifconfig bridge02 down

$DO ifconfig bridge03 down


brctl show | grep "eth" | awk '$1 ~ /bridge/ {brname=$1} {print brname " " $NF}' | while read bridge if; do

$DO brctl delif $bridge $if

$DO ifconfig $if down

done


$DO brctl delbr bridge01

$DO brctl delbr bridge02

$DO brctl delbr bridge03


cat > $TEMPFILE1 << EOF

bridge01 eth0

bridge02 eth1

bridge03 eth2

EOF


shuf > $TEMPFILE2 << EOF

eth3

eth4

eth5

EOF


paste -d" " $TEMPFILE1 $TEMPFILE2 | while read bridge if1 if2; do

$DO ifconfig $if1 down

$DO ifconfig $if2 down

$DO brctl addbr $bridge

$DO brctl addif $bridge $if1

$DO brctl addif $bridge $if2

$DO ifconfig $if1 0.0.0.0

$DO ifconfig $if2 0.0.0.0

done


$DO ifconfig bridge01 up 192.168.100.3 netmask 255.255.255.0 broadcast 192.168.100.255

$DO ifconfig bridge02 up

$DO ifconfig bridge03 up

$DO route add default gw 192.168.100.1



rm $TEMPFILE1 $TEMPFILE2



Appendix C

Useful scripts

domasq.sh

Gateway script: Set up masquerading (NAT) on the gateway.


#!/bin/sh


for net in 192.168.10.0/24 192.168.11.0/24 192.168.15.0/24 192.168.100.0/24; do

iptables -t nat -A POSTROUTING -o eth0 --source $net -j MASQUERADE

done

slowrestart.sh

Client script: Restart networking, with a five-minute delay.


#!/bin/sh


sleep 5

ifdown eth0

killall dhclient3

> /var/lib/dhcp3/dhclient.leases

> /var/lib/dhcp3/dhclient.eth0.leases

sleep 300

/etc/init.d/networking start

runrestart.sh

Client script: Run slowrestart.sh as a background job.


#!/bin/sh


nohup < /dev/null > /dev/null 2>&1 /usr/local/bin/slowrestart.sh &

exit 0

clients_nwrestart_slow.sh

Gateway script: Connect to all clients and initiate a slow network restart on each.


#!/bin/sh


clients=`/usr/local/bin/listleases.sh`


for a in $clients

do

echo Restarting networking on $a

ssh -oStrictHostKeyChecking=no $a '/usr/local/bin/runrestart.sh'

done


echo Done.

listleases.sh (1)

Gateway script: Connect to DHCP servers and enumerate active leases.


#!/bin/sh


TMPFILE=`mktemp`

ssh 192.168.10.10 /usr/local/bin/listleases.sh > $TMPFILE

ssh 192.168.11.10 /usr/local/bin/listleases.sh >> $TMPFILE

sort -u $TMPFILE

rm $TMPFILE


listleases.sh (2)

DHCP server script: Enumerate active leases.

#!/bin/sh


awk '/^lease / { addr=$2 }

/binding state active/ {print addr}' /var/db/dhcpd.leases | sort -u


dhcp_reset.sh

Gateway script: Connect to both DHCP servers, stop dhcpd, completely reset the DHCP lease database and then restart dhcpd, with a 10 second delay.


#!/bin/sh


for host in 192.168.10.10 192.168.11.10; do

ssh $host '(killall dhcpd gluff; sleep 4; > /var/db/dhcpd.leases; > /var/db/dhcp.leases.log; > /var/log/dhcpd.log)'

done

sleep 10


for host in 192.168.10.10 192.168.11.10; do

ssh $host /etc/rc.local

done

clients_restart.sh

Gateway script: Connect to all clients and restart them.


#!/bin/sh


clients=`/usr/local/bin/listleases.sh`


for a in $clients

do

echo Restarting $a

ssh -oStrictHostKeyChecking=no $a 'shutdown -r now' &

done


wait

echo Done.

team_stop.sh

Gateway script: Connect to all virtual systems and shut them down, then shut down the gateway. Note the use of the hosts.txt file, containing a list of server addresses.


#!/bin/sh


hosts="`/usr/local/bin/listleases.sh` `cat hosts.txt`"

for a in $hosts

do

echo Shutting down $a

ssh -oStrictHostKeyChecking=no $a 'shutdown -h now' &

sleep 3

done


wait


shutdown -h now

ihttp://www.ubuntu.com/

iihttps://www.isc.org/software/dhcp

iiihttp://www.vmware.com/pdf/vmware_timekeeping.pdf

ivSee RFC 3046: http://www.faqs.org/rfcs/rfc3046.html

vhttp://www.cisco.com/en/US/docs/switches/lan/catalyst4500/12.1/13ew/configuration/guide/dhcp.html