|  BlueWolf's Howl   | Comics and Art  | Higher Level  | Photography  | Poetry and Stories  |
|  Chess  |  Letterboxing  |
|  2003 Blogathon Archive  |   2005 Blogathon Archive  | 8th Layer Archive  | Blue702 Archive  |

BlueWolf's Howl

« She's (not) stacked... | Bluewolf's Howl | Trace Schmace »

January 31, 2008

It's a Network Problem

Yeah...it's *always* a "network" problem (except when they can blame it on the firewall in some obtuse way). Just because something happens *on* the network, doesn't mean it's *caused by* the network. I guess it's because people don't really understand networking - and it _must_ be something....ah...whatever thingie it is that they don't understand...

And so it is with the newest debacle with which I've gotten entwined. In the early morning hours, for whatever reason, 10 servers all lost connectivity with the network. There are 150 to 200 servers on the same VLAN. All are connected to the same switch. The only servers that lost connectivity were these 10 Sun servers. Other Sun servers - in the same VLAN, on the same subnet, on the same switch - did not show any interruption. Not one of the many Windows Servers - in the same VLAN, on the same subnet, on the same switch - did not show any interruption. There were no entries in the switch logs regarding any interruption or issues. Yet, this is still a "network" issue. [Read: I have to figure out what the damn problem is in order to get out of the hot seat.]

Here is why they don't believe it's not a network issue:
1) They are on a switch that has been historically unreliable.
2) More than one server was affected.
3) They imagine that perhaps there was a 'network' issue, but only these 'special' servers are sensitive enough to detect the problem.
4) A high-visibility app was affected. [This is the *real* reason they need a neck to choke.]

The commonality between the affected servers -- they were all running IPMP and the non-affected servers were not.

Now, often people think I'm odd because I have a thing for Operating Systems. I can configure, install, and maintain any Win-based server -- after years of experience and lots of study. See the Cert Battle Stats. But, I'm not satisfied there. I also have been playing/dabbling with Linux. And I have an old Sun Sparc that's running Solaris. Yeah, I could take another job as a Windows Admin. No, I'm not proficient enough to take a job as a Unix Admin. But, I really like to know these systems and how they work. As you can see - even though I work with routers and switches, it's still something I need to know for my job. You can't rely on the 'other guy' being as sharp (or as motivated) as yourself. And if they're blaming the network, why should the server guy waste cycles on investigating any further?

So I found a white paper on the Sun site which fully explains IPMP. Apparently, the way the cables know they're still good is that they send a multicast packet (224.0.0.1) and the first few devices to respond end up being the 'target host' for probe-based failover. Okay...so say host 10.4.3.15 responds quickest because it's on a powerful server? And what if all the IPMP connections are pinging that server to prove to themselves that they're alive? And what happens if that server just happens to go down? Well...if the target host fails, all the connections using that target host are going to falsely believe that they're down. Boom. It makes sense.

So I run this by the Sun guys. Apparently, they're under the understanding that the connections use 'the router address' for their probe-based failover. However, what they believe to be a 'router' address - it's actually the IP address attached to the VLAN. Yes, it's like a loopback address - only exists in SOFTWARE. So if (and nobody's checked it yet) they were using that address, there is no way it could not reach it. Perhaps the router was overwhelmed. Ah...no. It's not hardware - it can't be overwhelmed. Either they're using some other host address or the servers themselves were too busy. If the whole VLAN was inaccessible, other devices would have noticed. More servers would have lost connectivity and there would be an error message in the log.

The Sun guy is going to check it out. At least one person realizes that there's a possibility (no matter how slim a chance he thinks it may be) that it might actually be a server problem. One down. Oh so very many to go...

Posted by BlueWolf on January 31, 2008 06:08 AM