Tight PVM integration and SGE 6.2 issues & workarounds
Klaus Schnepper finds that "... The new builtin interactive job support in qrsh breaks tight pvm integration in grid engine 6.2u1." - the main cause is that the built-in interactive task support mechanism in SGE 6.2 does not seem to transmit the final close on stdin to the launched PVM slave daemons. The PVM slaves rely on this closure to begin their own startup routine.
In this mailing list post, Klaus describes the issue in detail and provides workarounds (and a patch for pvmd.c).
Screencast showing 6.1 to 6.2 inplace upgrade
Lubomir Petrik published the above screencast, check it out.
SGE 6.2 beta candidate also out
Wow, check out the new feature list in the just released SGE 6.2 beta announcement:
- GUI based installer helping new users to more easily install the software. It complements the existing CLI based installation routine
- New support for 32-bit and 64-bit editions of Microsoft Windows Vista (Enterprise and Ultimate Edition), Windows Server 2003R2 and Windows Server 2008.
- Client and server side Job Submission Verifier (JSV) allows an administrator to control, enforce and adjust jobs requests, including job rejection. JSV scripts can be written in any scripting language, e.g. Unix shells, Perl or TCL.
- Consumable resource attributes can now be requested per job. This makes resource requests for parallel jobs much easier to define, especially when using slot ranges.
- On Linux, the use of the 'jemalloc' malloc library improves performance and reduces memory requirements
- The use of the poll(2) system call instead of select(2) on Linux systems improves scalability of qmaster in extremely huge clusters
Note that this beta release is not for production use and is aimed at an experienced SGE audience. Please test it out and give the developers your feedback!
The announcement has all the details...
SGE 6.2u1 released today
Grid Engine 6.2 update 1 has been released, the official announcement page does not seem to be up yet but you can find the older "available" notice here.
The list of fixed issues is significant and can be viewed here:
http://gridengine.sunsource.net/project/gridengine/62patches.txt
6.2 courtesy binaries now available
The 6.2 release was previously only available at http://www.sun.com/software/gridware/
Downloads for GE 6.2 are now available through the gridengine.sunsource.net site as well, the release announcement is here:
http://gridengine.sunsource.net/project/gridengine/news/GE62-available.html
Bug alert: Beware scheduler profiling in SGE 6.2
The command "qconf -tsm" when run as the root user is a nice (but totally under-documented in the past) tool for SGE admins. The command (when it works) does a one-time dump of scheduler information and writes it to the location $SGE_ROOT/$SGE_CELL/default/schedd_runlog.
Props to DanT for discovering an interesting bug in Grid Egine 6.2 -- if you invoke the command "qconf -tsm" the process does not stop after the first attempt -- it keeps on repeating the command and growing the schedd_runlog file over and over again (every scheduling interval).
This is not a huge bug but it does have two negative consequences:
- Scheduler profiling is non-trivial, doing it repeatedly each scheduling interval may place additional load on your qmaster
- Most SGE admins would not be rotating or otherwise tracking the size of the schedd_runlog file as they would other SGE files like "accounting" that grow over time. Left unchecked on a busy cluster, this file may grow and cause space issues on the $SGE_ROOT filesystem
A really interesting facet of this bug is that restarting SGE and/or the scheduler has no effect and does not fix the recurring profile dump. This is likely why the issue was rated with a higher than normal severity level. Expect a patch or fix to be issued shortly.
T-Shirt Contest
Want a T-shirt? Be quick and email Andy. Details below.
Do you want to win a truly nice open source T-Shirt?
There are T-shirts to win in three categories:
1. Among the first 50 of you who reply *directly* to me (andy.schwierskott@sun.com) and tell us what is the single most important or interesting feature in SGE 6.2 for you, we'll draw three T-shirts.
2. Three T-Shirts goes to those persons who first report that they have upgraded their production cluster to SGE 6.2. Test-beds, eval clusters, private use doesn't count.
3. Three T-Shirts goes to those persons who will be using SGE for the first time, be it because you replace another DRM system or be it because you start using a DRM system for the first time. Requirements: it must be SGE 6.2 and it must be production use, not just a test-bed, private use or eval cluster.
We'll respect your privacy and only make your name public if you agree to it! Sun Microsystems employees may not participate.
Please feel free to populate this announcement and 'lottery' to mailing lists who take care about the SGE technology.
Regards, Andy
6.2 Officially Out
Grid Engine 6.2 is officially out, follow the links in the blog post below to read DanT's excellent set of articles on "why upgrade to 6.2?".
Get it here:
http://www.sun.com/software/gridware/
This also marks the official transition to having all of the Sun SGE documentation and manuals in wiki form:
http://wikis.sun.com/display/GridEngine/Grid+Engine
SGE 6.2 Coming August 5th
Unofficial word is that the official release of SGE 6.2 is coming on Tuesday, August 5th.
To read up on why this is news, check out Dan's excellent essays:
qrsh changes planned for SGE 6.2
In this mailing list post from early December, Dan brings news of some interesting changes planned for how Grid Engine is going to handle interactive jobs starting with the SGE 6.2 release planned for mid-2008:
"In 6.2, coming middle of next calendar year, the mechanism for launching interactive jobs has been rewritten. With 6.1 and previous, an interactive job is started by submitting an rshd as a job and then forking an rsh to connect to it. (Or rlogin or telnet or ssh.) With 6.2, starting an interactive job means that the shepherd will fork a shell with a PTY and connect back to the qrsh client, no rsh/rshd involved. (With Grid Engine, the slave tasks for a parallel job are started via the interactive job mechanism.)
There are several important benefits to the new mechanism. First, no more rlogin port limits. Second, SGE certificate-based security will actually encrypt the communication streams of interactive jobs. Third, no more worries about whether your rlogin/rsh/ssh binaries are "tightly integrated" with Grid Engine. Fourth, you actually get a PTY... "
This is a very new direction for qrsh based job and parallel task launching and I'm guessing it will be very enthusiastically received by the community as it greatly simplifies setup and administration. Dealing with rsh and SSH integration issues has always been a challenge.
A follow-up post asked about how this new mechanism will affect users of Kerberos and AFS which require token passing between machines as a way of securely handling distributed authentication and authorization. As someone who has personally discovered for the first time the joys of single-sign-on between Linux and Apple Mac OS X systems via Kerberos tokens and keytabs I'll be interested in seeing how this plays out.

XML Feeds