I recently implemented a Fault Tolerant File Server for a client and had been monitoring the system for a few weeks before handing the “keys” over to the client. This solution was chosen via a stringent tool selection process and I was very happy with the choice.
About 5 weeks later I started to receive calls from the client stating that they cannot access their files. “This is compromising our business – what is going on!!” Initial investigation showed that the File System access was fine. The client stated that this was an intermittent problem – “It never happened before your solution was implemented, so that meant that the problem is with the File System“.
Further investigation showed that:
- the hardware was up and running,
- the operational lights were running as normal,
- the IP cable was secure and the network port light was flashing as normal.
So it might be that the software on the File Server is faulty, maybe it’s the configuration or perhaps there are thresholds being reached and not being managed correctly? How could this have gone wrong? These were things that I was saying out loud to myself whilst trouble shooting. One of the features of the File Server solution was its advanced management system. The error logs were complaining that the unit itself could not access the network. So I glanced at the switch and routers and all looked fine.
I asked the users if anything had changed within the office since I was last there. They replied that there was no new software installations and the hardware had not been changed. Everyone had been performing their work as before – until my “solution messed things up!” The only thing that happened was that there was a power shortage for about 5 minutes the day before the issue started.
The client also gave the same replies to the questions. Except that after the power shortage, he decided to attach the file server to an Uninterrupted Power Supply (UPS) unit. This unit sits under and behind one of the office switches. So I checked each and every cable one by one. Removing the cable and then testing the connectivity within the office. One cable had a missing secure tab and by gently moving the cable it dislodged itself from the port. A quick test confirmed lost connectivity between the majority of computers (and the File Server). By changing the cable to a more secure one, there had been no connectivity issues (and there are still none for over 2 weeks).
So what is the moral of this story then?
It looked like the problem of no connectivity to the Fault Tolerant File Server was originated in the File Server itself. This was the main focal point for the users and its lack of availability meant that they rightly saw this as the cause of their ills. However, in this instance, the cause of the problem resided in a relatively unrelated area, entirely. By accessing the UPS unit, the faulty cable became slightly dislodged and vibrations within the office would dislodge and re-lodge the cable. This would prevent access to the File System (which was the focal point of the office), but it also prevented access between the majority of computers within that particular network topology and this was not visible to the users.
This reminded me of software – in that in the majority of high-end integration testing, the initial problem found is normally only a symptom of the problem and not the cause of the problem. Testers and users may need to investigate issues further and allow time to do so.
The only question now is…. was the faulty cable the root cause or a symton of another problem?