Lemieux

Lemieux

Lemieux was decommissioned on December 22, 2006.

  • The queues were turned off at midnight EST, Tuesday, December 19.
  • Logins were disabled at 5 pm EST on Friday, December 22.

See the migration document for guidance on migrating from lemieux to bigben, and read the bigben document for additional information on computing on bigben.

Contact PSC User Services with any questions.

Additional Information


Storing Files

-->

File repositories

File repositories are file storage spaces which are not directly connected to a frontend or compute node. You cannot, for example, open a file that resides in a file repository from inside a program. You must use explicit file copy commands to move files to and from the repository. You currently have one file repository available to you on lemieux: golem.psc.edu.

Golem is a combination tape-and-disk archival system.

Transfer files to and from golem interactively before and after your batch job runs, so your job does not tie up compute nodes while performing file transfers.

If you need to store a file to golem that is 2 Tbytes or larger please first contact User Services so that special arrangements can be made to store your file.

Files can be stored and retrieved from golem using one of these methods: tcscp, far, kftp, gridftp, scp or sftp.

tcscp

The tcscp program can be used to transfer files between golem and your lemieux home directory or $SCRATCH. To transfer files between golem and $LOCAL on your compute nodes you must use either $SCRATCH or $HOME as intermediate locations.

far

The far program can be used to transfer files between golem and your lemieux home directory or $SCRATCH. To transfer files between golem and $LOCAL on your compute nodes you must use either $SCRATCH or $HOME as intermediate locations.

We recommend using tcscp rather than far for transferring files between $SCRATCH or $HOME and golem. If you prefer the far interface you can indicate that you want far to actually use tcscp for file transfers by placing the line

set transport = "tcscp"

in the file .far.conf in your home directory.

kftp, gridftp, scp, sftp

You can use kftp, gridftp, scp or sftp to transfer files between golem and your remote machine. We recommend against using sftp. See the golem Web page for more information.

Transferring Files

You can use kftp or scp to copy files between lemieux and a remote machine. The far program transfers files from lemieux to golem, the archival system.

Both kftp and scp are secure file transfer methods. But while kftp encrypts the authentication information only, scp encrypts both the authentication information and the data being transferred. For this reason, kftp is generally faster than scp.

Kftp

Lemieux is running both a Kerberos 5 (K5) client and server. Thus, if your local machine has K5 client/server software installed, you can use kftp to transfer files to and from lemieux whether you are logged into lemieux or your local machine. The examples below assume that you are logged into your local machine.

Before you can use kftp to transfer files, you must authenticate yourself to lemieux. To do this use the kinit command.

kinit username@PSC.EDU

For 'username' substitute your PSC userid. PSC.EDU is PSC's Kerberos realm name.

After you enter this command you are prompted for your PSC Kerberos password, which is the password you use to login to lemieux.

Once you are authenticated you can use the kftp command to actually perform your file transfers.

kftp lemieux.psc.edu

The kftp command functions like the ftp command.

To transfer files into or from $SCRATCH you will need to specify the full pathname without using any variables.

You should verify that the Kerberos commands operate on your local system as described here. Some installations of Kerberized ftp differ in their implementation.

Man pages for kinit and kftp are available on lemieux.

A Unix kftp client is available at http://www.pdc.kth.se/heimdal. A Windows kftp client is available at http://web.mit.edu/network/kerberos-form.html.

Scp, discussed below, and sftp can also be used for lemieux file transfers, but kftp will generally be much faster. We recommend against using sftp.

Scp

The secure copy program, scp, can be used to transfer files between your remote machine and your lemieux home directory or your $SCRATCH directory. You cannot use scp to copy files between $LOCAL and your remote machine. To get remote files to $LOCAL you must go through your lemieux home directory or $SCRATCH.

The format for the scp command is

scp source-filename target-filename
where the filename on the remote system, whether it is the target or the source, must be specified as
username@system:filename

For example, to copy a file to your $SCRATCH directory on lemieux when you are logged in to your remote system use a command such as

scp file username@lemieux.psc.edu:/usr/scratch/n/username/file
If you are logged in to lemieux and you want to copy over a file from your home system to $SCRATCH, use a command such as
scp username@remote-system:file  /usr/scratch/n/username/file

The first time you use scp to or from lemieux, you will receive a message similar to

Host key not found from list of known hosts.  Are you sure 
you want to continue connecting?

Answer 'yes' to make the connection. You should not receive this message on subsequent connections.

You will be prompted next for your password on the remote system.

You may be able to improve the scp transfer rate by choosing the blowfish encryption method rather than using the default. To do this, type:

scp -c blowfish file  username@remote-system:file

For more information on the scp command, see the scp man page.

Scp is part of the ssh distribution. PSC provides a list of sites that distribute ssh.

We strongly recommend that you use kftp rather than scp if kftp is available.

Far

You can use the far program to move files between your lemieux home directory or your $SCRATCH directory and golem, PSC's file archiver.

Tar

Whether you are transferring files between lemieux and golem or between lemieux and your remote system, if you have many files--1000 or more--it is much more efficient to tar them up into one tar file and then transfer this single tar file, especially if they are small files, 64 Kbytes or smaller

Tru64 tar--located at /bin/tar--can only create tar files up to 8 Gbytes. Gnu tar--located at /usr/psc/gnu/bin/tar--can create files larger than 8 Gbytes. However, a file created by Gnu tar that is larger than 8 Gbytes cannot be read by Tru64 tar.

You should not make a tar file larger than 50 Gbytes without first contacting User Services. You should move your tar files to golem or to your remote system as soon as you can after you create them.

The batch queue

The debug queue

System drain

A full system drain is scheduled for Wednesday at 8:00 a.m. unless it is superseded by a drain for a test time. We try to schedule our test times so that they correspond to this Wednesday 8:00 a.m. drain. Check the lemieux status page to see if a test time is scheduled for lemieux.

Comments on system queues

Send email to remarks@psc.edu if you have complaints about your job turnaround or if you need special scheduling considerations to meet a project deadline.

Batch access

Qsub

Sample job script

prun

More information is available on the prun man page.

Submitting your script for execution

Other qsub options

-l rmsrails=n (-l is lowercase "L")
-l rmsproject=project

Executing a script with prun

Multiple simultaneous parallel executions in a single job

Chaining Jobs

File I/O

Tcscp and golem

Tcscp can be also be used to copy files between $HOME or $SCRATCH and golem. In fact, we we recommend that you use tcscp rather than far for these types of file transfer. For example, the command

tcscp golem:input.dat $SCRATCH/input.dat

will copy input.dat from your golem home directory to $SCRATCH. You should copy your files between golem and $HOME or $SCRATCH before or after you run your batch jobs.

If you prefer the far interface to the tcscp interface you can indicate that you want far to actually use tcscp for its file transfers by placing the line

set transport = "tcscp"

in the file .far.conf in your home directory.

Interactive Access With the Qsub -I Option

The -I (capital "i") option to qsub can be used to provide a form of interactive access on lemieux.

Interactive jobs in the debug queue

Interactive jobs in the batch queue

-->

pmm

A simple alternative to the Lemieux monitor is the pmm command. Issue it from a lemieux login node, and it will display the status of all the nodes in the system. Type pmm -help for a fuller description.

Checkpointing

PSC has created a custom user-level checkpoint/restart system to facilitate automated job recovery in the event of system failures. If there is a system failure, any jobs abnormally terminated by it can be automatically restarted if they have been prepared to do so. Proper job preparation includes adding explicit code to write and read checkpoint files. If you do not explicitly write out checkpoint files while your code is running then they will not be available if your job is restarted, and your code will probably have to repeat all of its prior computations. PSC provides a library of routines and an API to make it easier for you to do this. The checkpoint system keeps track of the nodes on which your processes are running and the nodes to which they are assigned when your job restarts, so that each process will have access to the correct checkpoint file.

The more processors you use and the longer your job runs, the higher the probability your run will be interrupted by a system failure. Structural features of your code, such as how long it takes your code to get from one state to the next, will also determine how often you should write out checkpoint files. You will need to balance the amount of time it takes to write your checkpoint files with the amount of computation your code will have to perform upon restart. As a rough heuristic, if your job uses hundreds of processors then you should checkpoint every three hours.

Checkpoint Library Routines

The checkpoint library contains the following routines, which can be accessed via C, C++ or Fortran. Fortran users do not include the trailing underscore. There is also a set of C routines without the trailing underscore. These routines only require pointer arguments for the string variable arguments.

Sample C and Fortran programs demonstrating the checkpoint routines and a sample makefile are available on TCS machines in the /usr/psc/examples/tcsio directory.

int tcs_init_(void *exp);

This routine initializes the TCS IO library. It must be called before any other functions in the library. If this routine is not called first, all other TCS IO library functions will fail with an appropriate message. The purpose of the tcs_init_ function is to allocate the necessary resources (e.g. in the RMS DB, the file systems, etc.) to track the state of the running job. The exp ("expansion") parameter is reserved for future use, so you should specify it as 0. The routine returns a non-zero value if unsuccessful.

int tcs_jobrestarted_();

This routine indicates whether your job is a restarted job or not. It returns zero if it is not a restarted job. It returns the jobid of the prior job if the job has been automatically restarted or if the prior job called tcs_preservestate_. It returns a negative integer if it fails to determine the restart status of the job. Your job should call this routine to determine whether it should read data from checkpoint files before resuming normal processing or just begin with normal processing.

int tcs_open_write_(char *prefix);

This routine opens a checkpoint file in a prefix set for writing. The value of the variable prefix is an alphanumeric string of no more than 20 characters. Each prefix corresponds to a prefix set of checkpoint files. No two open checkpoint files can have the same prefix, but if you close a checkpoint file and then open another with the same prefix, a new checkpoint file is created in the set of files for that prefix. Your program can have multiple prefix sets, with different prefixes, open at the same time. You would use multiple prefix sets if you want to store different types of data in separate checkpoint files. A complete checkpoint must have exactly one file for each prefix set.

When you use a prefix to open a checkpoint file for reading, your job will be given the latest complete file in the set of files for that prefix. Thus, a common checkpoint strategy will be to open, for each group of data you want to save, a checkpoint file with an appropriate prefix at the bottom of your program's "outer loop", write the checkpoint data, and then close each file. Each pass through the loop will create one new version of a checkpoint file for each of your checkpoint prefix sets. You may have to use a barrier to prevent your prefix sets from becoming out of sync by more than 1 time step across your processors. If there is a system failure you will be able to access the latest checkpoint file in each prefix set. The routine returns -1 if unsuccessful. Otherwise it returns a logical unit number, which must be passed to the write and close routines.

int tcs_open_read_(char *prefix);

This routine opens a checkpoint file for reading. This is the routine you would use if the routine tcs_jobrestarted_ indicated that your job is a restarted job. The value of the variable prefix is an alphanumeric string of not more than 20 characters that is used to determine which set of checkpoint files will be considered for opening. Your job will be given the latest complete checkpoint file from the set of files with the prefix you specified. Some of the checkpoint files could have been automatically recreated if they were lost due to system failure. The routine returns -1 if unsuccessful. Otherwise it returns a logical unit number, which must be passed to the read and close routines.

long tcs_write_(int *lun, void *A, const long * len);

This routine writes to a checkpoint file. The value of lun is a logical unit number from a previous open for writing. Variable A is a pointer to the buffer where the data to be written out resides. Variable len specifies the number of bytes to write. The routine returns the number of bytes written.

long tcs_read_(int *lun, void *A, long int *len);

This routine reads from a checkpoint file. The value of lun is a logical unit number from a previous open for reading. Variable A is a pointer to the data buffer where the data will be placed. Variable len is the number of bytes to be read from the checkpoint file. The routine returns the number of bytes read.

int tcs_close_(int *lun, int *keep);

This routine closes a checkpoint file. The value for lun is a logical unit number from a previous open call. The value for keep is 0 if you just want your checkpoint file to be used for recovery purposes, in which case the checkpoint file is saved in a system file area. Specify 1 for keep if you also want a copy of your file to be stored in the directory which you specify as the value for the environment variable TMDPIR. The routine returns a non-zero value if unsuccessful.

int tcs_finalize_(void);

This routine releases the resources allocated by the tcs_init() function. You should call it at the end of your job. It returns a non-zero value if unsuccessful.

int tcs_drainoperation_(void);

If your job has short checkpointable intervals (e.g. requires only 15 minutes to complete the calculations in each "time step") you may want to write checkpoint files after every "N" intervals. Typically the TCS machine administrators will know well in advance when the machine is going to be "drained" of running jobs. In such situations the TCS admin can post a notice to running jobs, via the RMS database, notifying them of an impending drain operation. This routine queries the state of that flag. It returns a positive integer if there is an impending drain operation, zero if no such notice has been posted, and a negative integer if it was unable to determine that status. If it returns a positive integer then the program should opt to write a checkpoint as soon as possible to avoid loss of work.

int tcs_preservestate_()

This routine causes the system to save your last set of checkpoint files from a run so that it can be used in your next run as if that run were a restarted run. In the next run tcs_restarted_ will return a non-zero value. You can then access your checkpoint files across runs.

To use this routine you must call it before you call tcs_finalize_. In your subsequent run you must set the environment variable TCS_RECOVERJOBID to the PBS jobid of your first run before you run your program. The use of tcs_preservestate_ is independent from the use of the keep feature to tcs_close_.

Checkpoint/Recovery Usage

In order to successfully utilize the PSC checkpoint/recovery system, you must follow these steps.

  1. Include the header file tcsio.h (tcsiof.h, if you are using Fortran) in your source code. This file is also a useful source of documentation and is available on TCS machines in the /usr/psc/include directory.


  2. Modify your source code to write checkpoint files at a "restartable point" in your program. A "restartable point" might be the end of an "outer" program loop (e.g. time step). If you already have checkpointing code in your program, replace your existing open/write/close statements with tcs_open/tcs_write/tcs_close statements. (See sample codes /usr/psc/examples/tcsio/test_*.* on TCS machines.)


  3. Use the sample makefile (LDFLAGS, in particular) to compile and link your program. (See /usr/psc/examples/tcsio/Makefile on TCS machines.) You must use gmake with this makefile.

Checkpoint Plans

PSC's implementation of the checkpoint system allows the user to choose from among several possible checkpoint strategies, or "plans". Each plan is identified by a string (e.g. "C1") of historic origin. To select a specific plan, set the TCS_PLAN environment variable in your PBS job script to one of the values listed below. Some plans have additional parameters that the user can define. These are listed below in the descriptions of each plan. If the environment variables specified below are undefined then the checkpoint system will use default values.

To date, in all plans the primary copies of checkpoint files are written to local disk on the compute nodes. Different plans achieve robustness by backing up these primary checkpoint files in different ways. There may, however, be future implementations in which primary checkpoint files are written on remote servers.

Currently, only 1 plan is available.

Plan "B2": Duplication to File Server

Plan B2 is a direct, "brute force" approach. Under plan B2 every primary checkpoint file is copied to a remote file server immediately after closure. Within the tcs_close_ method, a request is passed to the TCS IO daemon on the (local) compute node requesting that the checkpoint file be migrated to a file server. The user's job then returns to computation while the local daemon performs the file migration on the user's behalf. Primary checkpoint files which are lost due to machine failure are thus recovered from their duplicate on the file server.

Pros:

  • This plan achieves the maximum effective bandwidth to disk, utilizing the system's buffered I/O.
  • This plan imposes no limits on the size of buffers passed to the tcs_write_ function.
  • Since file migration is done by a separate process (the TCS IO daemon) computation is minimally impacted by IO operations.

Cons:

  • This plan doubles the disk space requirements of the checkpoint system.
  • There is an implicit latency (the file migration time) during which a successfully checkpointed program could lose a node and be forced to recover from a previous checkpoint.

Plan "C1": XOR-based Parity File

Plan C1 is a clever implementation making use of "spare" computing cycles during an I/O operation. Under Plan C1 parity files are calculated via a bit-wise XOR of checkpoint data. On every processor, data are written to the primary checkpoint file via asynchronous I/O (aio) while also participating in a checkpoint calculation. The set of processors in the job are divided up into sets spanning N processes, each on different nodes, where N is a configurable parameter. The value of N can be set via the environment variable TCS_NODES_XOR; its default value is 8. Thus, a job can be recovered if no more than 1 out of every N nodes (per XOR set) has failed. During the aio write operation the first process in each XOR set performs an MPI_Reduce operation, calculating the XOR value and writing the resulting data to an XOR file, which is stored on a file server. The data from any missing node(s) can be regenerated by the reverse XOR calculation, including the N-1 primary checkpoint files and the XOR file for that set.

Pros:

  • By calculating only one XOR file per TCS_NODES_XOR files, the disk utilization overhead of this scheme is minimized.

Cons:

  • This plan requires that there is sufficient free memory to allocate an additional XOR buffer as large as the buffers passed to the tcs_write_ function. Thus, if a user is utilizing 3GB of memory on a node a write operation of chunks larger than 1GB would result in a failure.
  • The asynchronous I/O library (libaio.so) cannot make use of buffered I/O. Thus, it achieves write bandwidths more characteristic of direct I/O.

Scientific and Mathematical Packages

A list of scientific and mathematical packages installed on lemieux is available online.

Programming Tools

TotalView

TotalView is a debugger which can be used to debug serial and parallel codes. Online information is available for TotalView.

Pixie

Pixie is a tool which can be used to profile single processor performance of a code. See the man page for pixie for more information.

Hiprof

Hiprof is a tool which can be used to profile single processor performance of a code. See the man page for hiprof for more information.

Atom

Atom is a suite of tools which can be used to profile single processor performance of a code, including generating Mflop rates. It can also be used to measure scaling performance for parallel codes. See the man page for atom for more information.

Debugging Tips

Debugging strategy

Your first few runs should be on a small version of your problem. Your first run should not be for your largest problem size. It is easier to find code problems if you are using fewer nodes. This strategy should be followed even if you are porting a working code from another system.

Also, you should use the debug queue for your debugging runs. Never run a debugging run on the lemieux front ends. You should always run a lemieux program using qsub and prun.

Debuggers

The TotalView debugger is available on lemieux to debug parallel and serial codes. TotalView can run on at most 64 nodes because of license restrictions.

Compiler options

Several compiler options can be useful to you when you are debugging your code. If you use the -g option when compiling, the error messages the system provides when your code fails will probably be more informative. For example, you will probably be given the source code line number of the source code statement that caused the failure. Once you have a production version of your code do not use the -g option. If you do your code will run slower.

The -check_bounds compiler option will cause your code to report if it exceeds an array bounds while it is running. The -warn argument_checking option will cause your running program to report any mismatched procedure arguments.

The -g and -check_bounds options are available in Fortran and C, but the -warn argument_checking option is only available in Fortran.

Core files

Core files may be useful to you when debugging your code. Both TotalView and ladebug can be used to work with core files. In addition, if you uncover a system bug we may need the core file from your execution.

The default behavior of the system is to delete any core files when your job ends. To make sure your core file is saved after your job completes you must set the environment variable RMS_KEEP_CORE to 1 in your batch script before you run your executable. If you then provide us with the PBS jobid of your failed job we will be able to retrieve the core file.

Variable initialization

Alpha processors do not automatically initialize variables, unlike many other processors. This is a common cause of program failures when you port a code to lemieux from other platforms where they worked because the processor was automatically initializing your variables.

By default the HP Fortran compiler will warn you at compile time if you use an uninitialized variable. However, the Fortran compiler does not catch every case where this happens and for those cases it does catch it just issues a warning. It does not initialize the variable.

The -trapuv option to the C compiler causes all uninitialized stack variables to get a value that will cause your program to fail if they are used without another value being placed in them. Stack variables include local automatic variables and function arguments. Thus, the -trapuv option can be used to detect certain types of uninitialzed variables in C.

Exception handling

The HP compilers may handle exceptions in a manner different from the other platforms on which you may have worked or from which you are porting a code. For example, if you use the -check underflow option to the f90 compiler you will receive a runtime warning if there is an integer underflow. Otherwise your code will run through an integer underflow. Similarly, if you use the -check overflow option to f90 a floating point overflow will cause your code to fail. Otherwise your code will run through a floating point overflow. On the other hand, codes that generate "inexact results," such as NaN or Infinity, on other platforms will cause a floating point exception on lemieux. You can mimic the former type of behavior on lemieux if you use the -fpe3 option to f90.

The HP C compiler, by default, will run through all exceptions, except integer and floating point underflow and cases where results are affected by roundoff. If you want your C code to run through all exceptions you can use the -speculate all option to cc.

Little Endian versus Big Endian

The data bytes in a binary floating point number or a binary integer can be stored in a different order on different machines. Lemieux is a Little Endian machine, which means that the low-order byte of a number is stored in memory with the lowest address for that number while the high-order byte is stored in the highest address for that number. The data bytes are stored in the reverse order on a Big Endian machine.

If your machine has Tcl installed you can tell whether the machine is Little Endian or Big Endian by issuing the command

echo 'puts $tcl_platform(byteOrder)' | tclsh

Most Little Endian machines, including lemieux, have facilities for reading and writing Big Endian files. The reverse is often not the case for Big Endian machines.

If you are using Fortran on lemieux there are three methods you can use to read and write Big Endian files. First, if you do not want to change your source code or do not have access to the source code, you can set the environment variable FORT_CONVERTXX to 'BIG_ENDIAN' for each Fortran unit number that you want to use to read or write in Big Endian format. For 'XX' you substitute the unit number. For example, the command

setenv FORT_CONVERT15 BIG_ENDIAN

will cause unit 15 to process files in Big Endian format.

Second, if you want to have all your Fortran unit numbers to process data in Big Endian format you can use the -convert big_endian option to f90. Other values for -convert, such as cray or ibm, can be used to specify more precisly how you want your binary data to be handled. See the f90 man page for more information about the -convert option.

Finally, you can add the parameter

convert='big-endian'

to the open statements for those files you want to read or write as Big Endian.

There are no equivalent facilities available for C.

Optimization

Mflops

To improve the performance of your lemieux code you should begin with single-processor performance. To measure your single-processor performance you should compute the the Mflops rate for your code.

To compute your code's Mflops rate follow these steps.

  1. Compile your code.
    f90 -fast -o yourprog yourprog.f90 -lmpi -lelan
    
  2. Generate a list of the routines in your code.
    nm -g yourprog | perl -e'while (<STDIN>) {@fields = split(" ", $_); \
    if ($fields[4] eq "T") {print "$fields[0]\n";}}' > routines
    
  3. Edit file routines to put the name of your main program first. With a C program or a Fortran program without a program statement the main program will be called 'main'. If a Fortran program has a program statement the main program name will be the name in the program statement with an '_' appended to it. You can also delete from file routines any routines for which you do not want to determine an Mflops rate.
  4. Load the atom tools module.
    module load atom
    
  5. Instrument your code so that it will generate timings and operation counts.
    cat routines | atom -tool timer5 yourprog
    cat routines | atom -tool flop2 yourprog
    
    The first command will generate an executable named yourprog.timer5. This will generate timing data. The second command will generate an executable named yourprog.flop2. This will generate operation counts. You cannot generate instrumented codes on the scratch file systems.
  6. Run both instrumented executables on a single processor.
    prun -N 1 -n 1 ./yourprog.timer5
    prun -N 1 -n 1 ./yourprog.flop2
    
    The first executable will generate a file named tprof.out.XXX.YYY, where 'XXX' is derived from your job's compute node name and 'YYY' is derived from your job's process id. This file contains timing data for each of your routines. The second executable will generate a file named fprof.out.WWW.ZZZ, where 'WWW' is derived from your job's compute node and 'ZZZ' is derived from your job's process id. This file contains operation counts for each of your instrumented routines. You can use the information in these two files to determine an Mflops rate for your entire code and for each of your routines by dividing the operation counts by execution time.

An Mflops rate above 400 is acceptable on lemieux. If your code performs below that number you should invest some effort in improving its single-processor performance.

Cache performance

Poor cache performance is the most common cause of poor single-processor performance on lemieux.

Each lemieux processor has an L1 and an L2 cache. The L2 cache is a unified, 8 Mbyte second level cache. For information on the L1 cache see the Compaq 21264/EV68CB Hardware Reference Manual.

To calculate the cache miss rate for your code follow the steps below.

  1. Compile your code
    f90 -fast -o yourprog yourprog.f90 -lmpi -lelan
    
  2. Load the atom tools module.
    module load atom
    
  3. Use atom to instrument your code so that it will generate cache miss data.
    atom -tool cache yourprog
    
    This generates an executable named yourprog.cache. You cannot generate instrumented codes on the scratch file systems.
  4. Execute yourprog.cache on a single processor.
    prun -N 1 -n 1 ./yourprog.cache
    
    This will generate a file named cache.out, which contains the cache miss rate for your program.

A cache miss rate of greater than 10% is poor. If your code has a cache miss rate of greater than 10% you should use the standard methods for improving cache performance. They are described in http://oscinfo.osc.edu/training/perftunmic (see the link under "Handouts" at the bottom of the page) and http://www.psc.edu/training/WVU/Performance_Optimization.ppt.

Using third-party software

Replacing your own code by third-party software which has been optimized to run on lemieux can often improve the performance of your code. For example, for FFTs we recommend using the FFTW library. For linear algebra routines we recommend the dxml library. There is a man page for dxml on lemieux.

Compilers and compiler options

The HP compilers, in general, generate more efficient code than the Gnu compilers. The HP compilers provide various compiler options to you to try to improve the single-processor performance of your code. The most likely candidates for both f90 and cc are

  • -fast
  • -arch host
  • -tune host
  • -unroll 8
  • various levels of the -O option

The f90 options -transform_loops, -pipeline and -assume_buffered_io may also be useful. You should read the f90 and cc man pages to see what other options may improve the performance of your code.

We have found the -fast option to be most useful. The -fast option actually sets some of the other options. The f90 and cc man pages will tell you which ones.

However, be aware that some of the options, including the -fast option and the various levels of the -O option, may actually make the performance of your code worse. In any event, compiler options are not likely to have a significant impact on your code's performance.

Single statement performance measurement

When troubleshooting your code's single-processor performance it can occasionally be useful to determine how much time single statements are consuming. You can use the pixie tool to generate this information.

To generate single statement performance measurements follow the steps below.

  1. Compile your program with the -g option.
    f90 -fast -g -o yourprog yourprog.f90 -lmpi -lelan
    
  2. Execute your program on a single processor using the pixie tool.
    prun -N 1 -n 1 pixie -display ./yourprog > pixie.out
    
    The file pixie.out will contain timing data for each statement in your code. The pixie line number is the same as your source code line number.

If you want to generate output at this level of granularity for only a section of your code you can use the -sigdump option to pixie to reset the collection of timing data. See the pixie man page for more details on using -sigdump.

Memory usage

If your code uses all of the memory available on a node to user codes and also has to use virtual memory, then, in this situation, the use of virtual memory can slow your program down by a factor of ten. Your code has about 3.5 Gbytes available to it per node. For the best performance we recommend that your code use no more than 3.0 Gbytes of main memory per node and that you never use virtual memory.

To measure how much memory your code is using per node you can use the system routine to execute the command "ps xl" on each of your nodes. For example, in Fortran the call would be

call system("ps xl")
Only one processor per node needs to execute this command since it will report on the memory used by all of the processors on a node.

This command will report on the virtual memory and resident memory used by each processor. The sum of the resident memory values is the number that you want to keep under 3.0 Gbytes. The OS is very aggressive about paging out memory pages that are not in use, so the virtual memory size will often be much larger than the resident memory size.

One cause of using too much memory can be a memory leak in your code. The TotalView debugger can be used to detect memory leaks. The procedure to follow to do this is outlined at the TotalView web site.

Two rails

Your program can use either one or two communication rails by using the -l rmsrails option to qsub. The default is to use one rail. Using two rails my improve the performance of your code.

Three out of four processors

If the time between synchronizations in your code is small--250 milliseconds or less--then your code may perform better if you use three out of four processors per node rather than all four processors. This will free up the fourth processor for the system to handle system events. To use three processors per node set your value for -n in your prun to three times the value you specified for -N.

IO

We cannot yet make a general statement about whether your code should use $LOCAL or $SCRATCH for IO. We are continually working to improve both options. You should try both options, if possible, and see which option works better for you.

As much of your IO as possible should be unformatted IO. When using $SCRATCH you should use as large a buffersize as possible for your IO. Make your IO statements read and write as much data as possible and use the default values for blocksize in your open statements.

If you use $LOCAL and tcscp you should use the {rank} feature of tcscp.

MPI optimization

To make your code perform better you should use non-blocking MPI sends and receives. You should also post your receives prior to the corresponding sends.

Scalability

Another important factor in your code's performance is how well it scales. To determine how well it scales use the /bin/time command to time your pruns using increasing numbers of processors, while keeping the amount of work done per processor constant. As you increase the number of processors your execution time should decrease in a nearly linear fashion if your code scales well.

If your code does not scale well then an examination of the timings of your MPI calls can help you determine what the problem with your code is.

To collect timing data on your MPI calls follow the steps below.

  1. Compile your program.
    f90 -fast -o yourprog yourprog.f90 /usr/lib/libmpi.a -lelan -lelanctrl
    
  2. Generate a list of the routines in your code.
    nm -g yourprog | perl -e'while (<STDIN>) {@fields = split(" ", $_); \
    if ($fields[4] eq "T") {print "$fields[0]\n";}}' > routines
    
  3. Edit file routines to put the name of your main program first. With a C program or a Fortran program without a program statement the main program will be called 'main'. If a Fortran program has a program statement the main program name will be the name in the program statement with an '_' appended to it. You must insert the file /usr/local/packages/Atom/mpicalls.t into file routines in order to collect timing data on your MPI calls. You can also delete from file routines any routines for which you do not want to collect timing data.
  4. Load the atom tools module.
    module load atom
    
  5. Instrument your code so that it will generate timings.
    cat routines | atom -tool timer5 yourprog
    
    This will generate an executable named yourprog.timer5. You cannot generate an instrumented code on the scratch file systems.
  6. Run yourprog.timer5 on an increasing number for processors, while keeping the tprof.out.XXX.YYY files for each run.

You can evaluate the MPI timing data across your runs to help you determine what might be causing your code to scale poorly. If the MPI_Barrier time increases as the number of processors increases, you might have a load imbalance problem. You should redistribute the work across your nodes. If the MPI_Send, MPI_Recv or MPI_Reduce times increase as the number of processors increases, try to restructure your code to reduce the amount of communication between nodes.

Scaling Advantage Program

If you would like to optimize your code so that it can run on at least 1000 processors, you can get optimization assistance from PSC through the PSC's Scaling Advantage Program. This program includes consulting assistance from PSC, special queue handling if necessary, and service unit discounts, all of which are designed to enable you to scale your code to run on at least 1000 processors as quickly as possible. Send email to remarks@psc.edu if you would like to participate in this program.

Reporting a Problem

You have several options for reporting problems on lemieux.

  • You can call the User Services Hotline at 1-800-221-1641 from 9:00 a.m. until 8:00 p.m., Eastern time, on weekdays, and 9:00 a.m. until 4:00 p.m., Eastern time, on Saturdays.
  • You can send email to remarks@psc.edu.

Other Documentation

  • There are man pages for most system commands, including qsub, qstat, qdel and qalter. The pbs man page provides an overview of PBS. There are also man pages for the MPI routines. You must use the standard capitalization for the MPI routine when you use the man command.
  • The home page for HP is http://www.hp.com. Documentation specific to Tru64 can be found at http://h30097.www3.hp.com.

-->

Acknowledgement in Publications

PSC requests that a copy of any publication (preprint or reprint) resulting from research done on PSC systems be sent to the PSC Allocation Coordinator

Publications resulting from work done on lemieux should include a credit similar to

The computations were performed on the National Science Foundation Terascale Computing System at the Pittsburgh Supercomputing Center.