Sunday, June 21, 2009

Internode Connectivity Diagnostic Failures

Welcome to the inaugural posting of HPCMonkey! I hope these bits of experience prove helpful in managing your Windows HPC System.

Host Name Management
Did you know that in the current version of Microsoft Windows HPC Server 2008, all Host Name resolution is managed via the Hosts file? Really. Take a look on one of your Head Nodes or Compute Nodes. Look in C:\Windows\System32\Drivers\etc\Hosts. Open with Notepad or Wordpad.

The Implications
When the Head Node (HN) needs to communicate with a Compute Node (CN), it refers to its hosts file first, rather than using your internal DNS server, to look up the IP address. Generally this works fine, as the HN keeps a fairly current copy of the Hosts file. In the case where a CN needs to communicate with another CN, it too will refer to its own hosts file, rather than your internal DNS server. If this file is outdated, communication failures will occur, even if your DNS is up to date.

The Hard Learned Lesson
This hosts file is only updated about 10 minutes after all Provisioning activities are completed. So, if you are in the middle of provisioning say 100 nodes and 50 are complete, don't bother trying any diagnostics such as Internode Connectivity or MPI Ping-Pong. The tests will fail.

How to avoid this quirk in the future
Wait. Wait until all provisioning activities are complete, with nodes either going into an offline (successful deployment) state or into Unknown (failed deployment) state. Then wait another 10 minutes for the updated hosts file to be propagated to all CNs. Then you can start your diagnostic tests.

What will the future hold for HPC?
One would hope that a more robust and responsive hostname management system will be put into place, such as enabling DNS Services on the Head Node and allowing it to manage all hostname resolution within the cluster.

No comments:

Post a Comment

Comments are welcome. Please add tips, tricks and experiences of your own. Please do not post requests for support - I have a day job and will not have time to address individual issues.