Lisa FAQ

What does 'Lisa' mean?

We think the name 'Lisa' is appropriate for the system, because:

  • We like the name Lisa
  • The name is short and easy to type
  • 'Lisa' is easily understandable

If one wants, 'Lisa' can stand for:

  • Lisa Supercomputer Almere
  • Linux Supercomputer Almere

The first one honors the fact that large essential portions of the software that make systems like Lisa possible are from the open source community: GNU. GNU stands for "GNU's not Unix". The second honours the fact that de operating system on Lisa is Linux. Without the availability of free, open source operating systems like Linux, a cluster like Lisa would be nearly impossible.

Who can use the national compute cluster at SURFsara?

This question is most adequately answer here: Information for new users.

I want to acknowledge SURFsara for the usage of Lisa and the support I got.

We would appreciate if you put a text like this in your publications about projects wherein Lisa played a role:

We thank SURFsara (www.surfsara.nl) for the support in using the Lisa Compute Cluster.

What is AMD64

See this page.

Will my program run parallel automatically?

No, when you run an normal program on the cluster, it will not run automatically in parallel. Two things are required:

  • Parallelize the program, look for example here for parallelization software.
  • Run the program in a parallel job, see batch.

When should I specify more than one node ( -lnodes=4)

Specifying to use more than one node is only meaningful for a parallel job (see above). Serial programs do not benefit, on the contrary: they will use only one core on one node, thus spilling the rest of the cores. Here is an example how to use all cores in a job.

Why are nodes, not processors or cores allocated to a job?

A node on the Lisa cluster contains eight or more cores. It would have been possible to allocate single cores to jobs, so that eight serial jobs could run on the same node. We chose not to do so, but to allocate whole nodes to a job because of the following reasons:

  • We think it is important that processes belonging to different jobs interfere as little as possible. A process on a node will always hinder in some way another process on the same node, because they share memory, scratch space, access to memory (the memory bus) and the network interface.
  • Before a job starts, we want to clean its environment as much as possible: a pre-job script removes unwanted processes, cleans scratch space and removes shared memory segments. It is difficult to perform these tasks when another job is still running on that node without hindering that job.
  • We want to offer the opportunity to use all of the available memory for one process.
  • Most users tend to submit large amounts of comparable jobs. It is not difficult to run more than one process in one job, see this example.

Can I login on a node where my job runs?

You can login on a node on which your job is running using the command pbs_joblogin but maybe it is sufficient to monitor the node: pbs_jobmonitor.

Disk quota

Every user on the lisa system has her own file system, normally 200 Gbyte large. To get an impression how much is used, issue the following command:

  quota

You will get an output like this:

   willem@login4: ~/:=-> quota -h
   Filesystem            Size  Used Avail Use% Mounted on
   fs7:/lisapool/home/willem
                         75G   69G   6G  92% /home/willem

or, depending on the type of disks your home directory resides:

   willem@login4: ~/:=-> quota
   Disk quotas for user willem (uid 31009): 
     Filesystem   space   quota   limit   grace   files   quota   limit   grace
fs8:/lisapool/home/willem
                 69011M  76800M  76800M           21973       0       0        

Meaning that user 'willem' has a total disk space of 75 Gbyte, of which 69Gbyte is used and 6 Gbyte is free, usage percentage is 92.

Help, my disk quota is used up, and I cannot remove files!

If you really used up all your disk quota, there is not much you can do, even an attempt to remove some files:

  rm myfile 

does not work, because 'rm' temporarily needs some file space.

What to do? Removing a file is not possible, but it is possible to change the size of a file to zero bytes, here follows the very short command to change the size of 'myfile' to zero:

  > myfile 

When you have resized a few large files this way, you can proceed cleaning up your home directory using the 'rm' command again.

Can I use two or more cores in one job?

It is very well possible to use two or more cores in a job by starting two or more processes. The operating system will take care that both cores are used. An example is here.

What can I do to optimize the turn-around time of a job?

Lisa's job scheduler uses a first-in first-out strategy, complemented with backfilling and a fair share algorithm. This ensures that:

  • a job is guaranteed to run at some time.
  • nodes that are temporarily idle because of the fifo strategy will get filled with jobs that fit in the empty time slots
  • the system resources are divided evenly among users and user groups

To get you job running as soon as possible, specify a wall clock time that is as short as possible. The shorter the runtime of a job, the more chance that it is eligible for backfilling.

Tell me more about serial and parallel jobs

Serial and parallel jobs are defined as:

  • serial job: uses one node
  • parallel job: uses more than one node

Note: it is possible to run an parallel program in a serial job: the processes will run on one and the same node.

The system consists of nodes with infiniband and nodes without infiniband. Infiniband is a high-speed, low-latency network, especially to be used by parallel programs.

The system will schedule parallel jobs on infiniband nodes, and serial jobs on non-infiniband nodes. The decision is made on the number of nodes a job requests:

  • one node -> non-infiniband
  • more than one node -> infiniband

It seems that my job was run twice!

See To rerun or not to rerun

How can I determine if my script is running in the batch or interactively?

See Interactive or batch?

I need to see the output of my batch job immediately while executing to see if my program crashed but the output seems cut off!

You can arrange that output for a specific file, or all open files, is flushed to disk in your program:

  • C (from man fflush):
#include <stdio.h>
FILE *stream;
...
fflush(stream);   // to flush file 'stream'
fflush(0);        // to flush all open files
  • Gfortran and Intel Fortran:
integer unit
...
call flush(unit)  ! to flush file nr unit call flush(0)     ! to flush all open files

You can also arrange, that output of all files is flushed directly after each write:

  • C:
#include <stdio.h>
...
setvbuf(stdout, NULL, _IONBF, 0);
  • Gfortran: set the environment variable 'GFORTRAN_UNBUFFERED_ALL' before running your program to 'y':
export GFORTRAN_UNBUFFERED_ALL=y
  • Intel Fortran: default behavior is flush after each write, so you have to do nothing. Buffering is controlled with the environment variable FORT_BUFFERED:
export FORT_BUFFERED=n

I need to run very long jobs

If you specify a too long wall clock time, the system rejects your job. Often there is a possibility to checkpoint your programs just before the end of the allotted wall clock time, and restart them in another job. See our description of the DMTCP package.

I want to quickly test my jobs

To facilitate quick testing of short jobs, we dedicate a few 8-core nodes for jobs that ask not more than 5 minutes wallclock time. You can submit as many of these jobs as you want, but per user only one job will run at a time. Example testjob (the most important part is the #PBS line):

#PBS -lnodes=1:cores8:ppn=8 -lwalltime=5:00
module load openmpi/gnu
cd $HOME/workdir
mpiexec ./my-mpi-program

Only one job is running at a time

If you submitted many jobs, and they are running one by one, it could be that you specified a walltime less than or equal to 5 minutes. These very short jobs are submitted to a special queue and have a high change to run very quickly, but per user only one at a time. So, if you have many short jobs, specify a walltime larger that 5 minutes, for example:

#PBS -lnodes=1:cores8:ppn=8 -lwalltime=6:00
module load openmpi/gnu
cd $HOME/workdir
mpiexec ./my-mpi-program

My job hits the walltime limit, how to save my files?

As is explained in the description the file systems we urge you to read and produce files in $TMPDIR (or /scratch ). Problems can arise if your job hits the walltime limit: how to save the output files? In the description of the module sara-batch-resources a solution is presented.

Maintenance, so what?

A few times per year, you will see in the 'message of the day' (the message you get when you login in to lisa), that maintenance is planned. During this period the system will be upgraded or adapted.

Consequences for you:

  • During maintenance, you cannot login
  • Jobs, that would still be running at the start of the maintenance, will not be started

Can I receive mail on my Lisa login?

No, you can't receive messages from outside the Lisa system. De batch nodes can send mail to your login, but, in order to read them, you have to forward mail sent to your Lisa login.