Accessing the shark cluster
Shark consists of 25 nodes, namely the front end node shark.cs.uh.edu and the 24 compute nodes {shark01, shark02, ..., shark29}.cs.uh.edu. While all compute nodes have a public IP address, they are not accessible from the outside world. The cluster is controled by a batch scheduler called SLURM, which is running on the front-end node. Thus, any user wanting to access the compute nodes of the shark cluster needs an account on the front end node as well. If you would like to get an account, please contact gabriel [at] cs.uh.edu. The only access method of shark is by using ssh.
All compute nodes share a common home file system. Users can move files to shark without having to allocate a compute partition. In contrary to the pervious configuration of shark using salmon as the front end node, it is possible (and recommended) to compile on shark. Please follow the instructions on the how to compile an application for shark webpage.
In order to login to shark, you have to login first on the front-end node and reserve a partition on the shark cluster. Depending on the type of job you would like to execute, you have to allocate an interactive job or a batch job. For a detailed description of the slurm commands, please visit the SLURM documentation. The following lists the most common usage scenarios:
-
Allocate 4 nodes for an interactive job: first, request 4 nodes on shark by using the salloc command. For an interactive job the batch scheduler can either provide you immediatly with the requested number of nodes, or salloc will return with an error message. Next, verify using the squeue command which nodes have been allocated for your job. In the example shown below, the nodes shark01-shark04 have been allocated for this job. You can now login on these four nodes using ssh. The following shows the list of commands and the output by SLURM:
smith@salmon:~>salloc -N 4 bash salloc: Granted job allocation 489 smith@salmon:~>squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 489 calc smith R 0:02 4 shark[01-04] smith@salmon:~>ssh shark01
- Allocate 4 processors ( = 2 nodes on shark) for an interactive job:
smith@salmon:~>salloc -n 4 bash salloc: Granted job allocation 490 smith@salmon:~>squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 490 calc smith R 0:01 2 shark[01-02] smith@salmon:~>ssh shark01
-
Once you are done with your interactive job, you will have to exit from all nodes on shark. Additionally, you will have to exit the shell opened by the salloc command, e.g. using the scancel
smith@salmon:~>squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 489 calc smith R 0:01 2 shark[01-04] smith@salmon:~>scancel 489 smith@salmon:~>squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) smith@salmon:~>
-
Allocate 4 nodes for an interactive job for 30 min. Please note, that after the requested amount of time, SLURM will kill all application processes belonging to this job, including all open ssh connections.
smith@salmon:~>salloc -N 4 -t 30 bash salloc: Granted job allocation 491
- Allocate 4 nodes for an interactive job excluding node shark01
smith@salmon:~>salloc -N 4 -x shark01 bash salloc: Granted job allocation 492 smith@salmon:~>squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 492 calc smith R 0:01 4 shark[02-05]
-
Allocate 4 nodes for an interactive job requesting node shark19-shark22
smith@salmon:~>salloc -N 4 -w shark[19-22] bash salloc: Granted job allocation 493 smith@salmon:~>squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 493 calc smith R 0:01 4 shark[19-22]
-
Allocate 4 nodes for a batch job: if you would like to execute a long running job over night, you should submit your job to the batch queue. SLURM will run the job as soon as the required number of nodes are available. In the example shown below, a batch-script called run-job.sh is submitted to the scheduler. The output of the job will be located in the directory where you submitted the job, and the file is called slurm-{jobid}.out
smith@salmon:~>sbatch -N 4 ./run-job.sh sbatch: Submitted batch job 494 smith@salmon:~>squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 494 calc run-job.sh smith R 0:01 4 shark[01-04]
When starting a parallel MPI job using Open MPI, please note, that you can start the mpirun command from the very same window without any additional information. Open MPI will figure all details out for you. Please see alsoo the following entry on the Open MPI webpage. If you want to start the MPI job from a different window than the window where you made the salloc allocation, or if you would like to use MPICH/MVAPICH, the following script shows the typicall approach:
#! /bin/bash cd workdir srun -l /bin/hostname | sort -n | awk '{print $2}' > hostf mpirun -np 4 -hostfile hostf ./myexec exitYou can also use all options shown in the interactive job section, e.g. for time-limit, requesting or excluding specific nodes.
