Home |  Blog |  DNS scripts |  Loghost HOWTO |  Syslog-ng FAQ 

Nov 21 03:23:47 PST 2008
Your IP: 38.103.63.57

campin.net tent logo

Campin dot Net

Rebooting elysium

Connor received an email from a data warehouse developer. The email stated that the server named elysium was having filesystem trouble.

elysium has 26 CPUs and 26 gigs of RAM, it is a Sun E6500 filled to max capacity with CPUs and memory. All the data it parses is accessed via NFS from another sparc-based host hooked up to two terrabytes of storage over a fiber interface. The NFS server is in a different datacenter, but NFS performance is quite good. Eventually all the storage will be hooked up locally, use of NFS is a temporary measure.

Connor knew that elysium had been patched with the Sun recommended patch cluster a couple days before and needed to be rebooted to run the new kernel. He reviewed the system logs on elysium and saw that two nights earlier it had trouble accessing the NFS server for a few minutes. He logged into the NFS server and saw that the server had been rebooted at that time. He reviewed the location on the site's software distribution server where patch cluster installation output is stored, and saw that the NFS server had also been patched. He knew the reboot would have been done by the admin at the completion of the patch install.

He checked the NFS mounts on elysium, saw they were accessible and notified the developer that things currently appear ok, but that processes running two nights earlier could have had trouble and may need to be restarted. The developer replied an let Connor know that there were still problems accessing files.

Connor knew that elysium needed to reboot after the patch install anways, and wanted to remount the fileserver as well so he decided to reboot elysium to kill two birds with one stone. He had seen NFS problems between the two hosts clear up after remounting the server so he had high hopes for the situation after the reboot. He ran the command 'w' and saw that it had been up for 170 days - ever since it was first installed and the (then current) patch cluster installed.

The hellish reboot

Connor rebooted the server and continued reading email while waiting for it come come back up. He waited about ten minutes and tried to connect to it via ssh without success. He sent it a ping and received no reply. He was a little worried but not overly so. elysium was in a datacenter about an hour away, but it was hooked up to a console server.

He looked up the console access information, but only had the IP of the console server, not the port elysium is hooked up to. He called the worker who maintains the company's equipment in their cage in the datacenter, hoping to get an answer. The worker was there, fortunately, and immediately gave the port number to access elysium.

Connor accessed elysium's serial port via the console server, using telnet. He was happy to see a normal login screen. He hated sending the root password over a telnet connection, but logged in as root anyways. He didn't really have any choice, elysium needed to get back into service right away. He made a mental note to change the root password on all hosts later. He had an expect script to do this on every host over ssh.

He ran 'ifconfig -a' and saw that both network interfaces were reported as up.

# ifconfig -a
lo0: flags=1000849<UP,LOOPBACK,RUNNING,MULTICAST,IPv4> mtu 8232 index 1
        inet 127.0.0.1 netmask ff000000
hme0: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 2
        inet 10.16.16.60 netmask fffffe00 broadcast 10.16.17.255
        ether 8:0:20:e1:ca:6f
hme1: flags=1000843<UP,BROADCAST,RUNNING,MULTICAST,IPv4> mtu 1500 index 3
        inet 123.123.123.123 netmask ffffff00 broadcast 123.123.123.255
        ether 8:0:20:e1:ca:6f
He queried the ethernet driver manually to get link, speed and duplex status:
# ndd -set /dev/hme instance 0
# ndd -get /dev/hme link_status
1
# ndd -get /dev/hme link_speed
1
# ndd -get /dev/hme link_mode
1
# ndd -set /dev/hme instance 1
# ndd -get /dev/hme link_status
1
# ndd -get /dev/hme link_speed
1
# ndd -get /dev/hme link_mode
1
He tried to ping elysium's default gateway IP, but received no response. He tried the gateway on the second interface, also with no success. This was weird, since the interfaces themselves still had link and appeared to be properly initialized.

Connor started to fear that the patch installation had gone wrong somehow, breaking elysium's networking functionality.

Busted ARP

He wanted to verify that the settings reported by elysium's kernel were consistent with what the switch it was connected to was reporting. Connor doesn't like to admit that he uses AOL instant messenger, but he does since everyone at his work uses it. It is really easy to contact network and development staff using it. He sent an instant message to the senior networking engineer, asking him to check out the switch. He was afraid of a possible duplex mismatch. elysium was forced to 100 mbit speed and full-duplex, as were all the switches. Mismatches sometimes occur despite this fact when one end mistakenly falls back to a slower speed or duplex setting. He posted the ifconfig output into the AIM message window.

The network engineer didn't see what was wrong. Connor ran 'ping -s 123.123.123.1 &' to get a constant ping going against his default gatway IP, and started snoop in non-promiscuous mode on the public interface:

# snoop -P -d hme1
He usually uses non-promiscuous mode so that he only sees packets destined for the host he is on (and broadcast packets of course). If he ends up needing complex expressions he knows how to use them, but usually doesn't need to.

Connor saw lots of broadcast packets, but nothing to or from elysium. Weird. He started pinging elysium from another host on the same segment in the same datacenter, using another window in his screen session. He loves the screen program.

Once he started pinging from the second host he started seeing ARP requests going out for elysium's hardware address:

 123.123.123.10 -> (broadcast)  ARP C Who is 123.123.123.123, elysium ?
...but no replies from elysium. Now he was getting somewhere, he knew that for some reason things were going wrong at the ARP protocol level. snoop is a great program since it understands common network protocols and uaually gives meaningful output. snoop doesn't relieve the admin of the need to understand how networks work, however, since it simply gathers information. It is up to the admin to interpret all the data collected.

He wasn't really sure what would cause ARP functionality to simply fail, other than the patch cluster installing a patch which either didn't install correctly or simply didn't work for some reason.

He wanted to test the interfaces, and see if they worked if static ARP entries were entered with the 'arp' command. Connor entered a static entry for a Linux server on the same segment as elysium:

# arp -s 123.123.123.10 00:60:4B:B1:C1:8C
...and on the Linux server the arp command had the same syntax:
# arp -s 123.123.123.123 8:0:20:e1:ca:6f
Connor tried pinging elysium from the Linux host, and it worked! He was very happy. He now thought that the public interface worked correctly except for it's ability to use the ARP protocol. He told the network engineer, who said that they should add static ARP entries on elysium and it's gatways on both interfaces.

Connor really liked this idea, as it would get the host talking to every host on the internet again except hosts on the local LAN, which it hardly had to do (only for DNS resolution, which could also be statically mapped or temporarily retrieved from DNS caches on another network segment).

Connor and the network guru added entries to their ARP tables, but with no success. Connor still couldn't ping either gateway. Neither of them knew what do to from there. Connor thought about it for a little while, and posted to the sun-managers mailing list to see if any Sun gurus had any ideas.

He started backing out patches that were installed with the patch cluster. He found a ARP patch, but it wouldn't allow itself to be removed. The same went for a driver for the network cards installed in the host. He successfully removed the kernel patch and another network-related patch, but neither patch removal (and subsequent reboot) changed the situation in any way.

Connor has done a lot of networking with Linux, including using it as a bridge device by employing proxy arp. He has been able to firewall networks and hosts this way without having to explicitly set up routing through the Linux firewall hosts. He decided to connect each interface on elysium to a Linux host with two interfaces, and use proxy ARP on the Linux box to get elysium back onto the network.

Someone from the sun-managers mailing list suggested using proxy ARP on another UNIX host to get elysium back on the network, so Connor knew he was on the right track.

Visiting the datacenter

Connor drove to the datacenter after dinner. He had to travel from the San Francisco east bay to Santa Clara - which is not a fun drive during rush hour. He arrived at the datacenter at around 9:00PM. The cage worker was still there, and helped gather up some crossover cables and trace elysium's ethernet cables through the patch panels to their ports in the switches.

Connor ran a crossover cable from elysium's public interface to an unused NIC on a Linux host that had an interface on the public net. He set up packet forwarding and proxy ARP on the Linux box, along with the proper routes to talk to elysium and the rest of the public network. He manually removed elysium's old default gateway and entered a new default gateway of the internal NIC on the Linux host.

He could ping elysium from the Linux box and ping the internal interface on the Linux host from elysium. He still couldn't ping or otherwise connect to any hosts on the public network from elysium nor could he ping or otherwise connect to elysium. It all seemed to be in place, but still didn't work.

He ran tcpdump on the Linux box, listening only on the internal interface, and ran a continuous ping from elysium. He was amazed to see the internal IP of elysium as the source of the packets. His eyes widened as the truth dawned on him.

The solution - WTF?

Solaris uses the same MAC address for all interfaces with the same chipset in a single host, which is based on the machine's host-id. He was starting to see how if you switched two interfaces on Solaris and hardcoded the MAC address on another host as he had, it could still talk to it despite having the wrong IP on that network. The remote host would send the packets to elysium's hardware address with the wrong destination IP address, but elysium accepts it since it is destined for one of it's local IPs.

Connor walked over to elysium and switched the ethernet cables in it's two network interface cards. He removed the crossover cables from the patch panels and connected elysium's interfaces back into the switches. He rebooted elysium to rid it of all the funky routes and static ARP entries, and it was fully functional on the network.

Somehow the two interfaces had their IPs switched. The strange way Solaris shares MAC addresses between interfaces masked the problem, by allowing the static ARP entries to partially fix the problem. Connor could have fixed the problem when it first apeared by switching the IPs for the two interfaces in /etc/hosts and rebooting.

The first question is who changed the interface IPs in the configuration during the 170 days of uptime without doing anything to ensure the host would still work after a reboot. This should never have been left in such a broken state.

The second question is whether or not the NFS problems will resurface. A whole day was lost after rebooting elysium due to improper configuration, with nothing even done about the NFS problems! :(

Of course those patches he backed out will have to be re-installed too :(

  Home |  Blog |  DNS scripts |  Loghost HOWTO |  Syslog-ng FAQ