JEDI Modules on selected HPC systems

If you are running JEDI on a personal computer (Mac, Windows, or Linux) we recommend that you use either the JEDI Singularity container or the JEDI Charliecloud container. These provide all of the necessary software libraries for you to build and run JEDI.

If you are running JEDI on an HPC system, Charliecloud is still a viable option. However, on selected HPC systems that are accessed by multiple JEDI users we offer another option, namely JEDI Modules.

Environment modules are implemented on most HPC systems and are an easy and effective way to manage software libraries. Most implementations share similar commands, such as:

module list # list modules you currently have loaded
module spider <string> # list all modules that contain <string>
module avail # list modules that are compatible with the modules you already have loaded
module load <package1> <package2> <...> # load specified packages
module unload <package1> <package2> <...> # unload specified packages
module swap <packageA> <packageB> # swap one module for another
module purge # unload all modules

For further information (and more commands) you can refer to a specific implementation such as Lmod.

We currently offer JEDI modules on several HPC systems, as described below. Consult the appropriate section for instructions on how to access the JEDI modules on each system.

These modules are functionally equivalent to the JEDI Singularity and Charliecloud containers in the sense that they provide all of the software libraries necessary to build and run JEDI. But there is no need to install a container provider or to enter a different mount namespace. After loading the appropriate JEDI module or modules (some bundles may require loading more than one), users can proceed to compile and run the JEDI bundle of their choice.

General Tips for HPC Systems

Many HPC systems do not allow you to run MPI jobs from the login nodes. So, after building JEDI, you’ll have to run the tests either in batch mode through a job submission program such as slurm via sbatch directives, or by accessing a batch compute node interactively through a program such as salloc. Often these batch nodes do not have access to the internet; after you build JEDI, you may need to run the following command from a login node:

ctest -R get_

This runs several tests. The purpose of these tests is to download data files from the cloud that are then used by many of the other tests. If the get_* tests are successful, then the data was downloaded successfully and you can proceed to run the remainder of the tests in batch using sbatch, salloc, or the equivalent process management command on your system.

Hera

Hera is an HPC system located in NOAA’s NESCC facility in Fairmont, WV. The following bash shell commands are necessary to access the installed JEDI modules:

export JEDI_OPT=/scratch1/NCEPDEV/jcsda/jedipara/opt/modules
module use $JEDI_OPT/modulefiles/core

If you use tcsh, use these commands:

setenv JEDI_OPT=/scratch1/NCEPDEV/jcsda/jedipara/opt/modules
module use $JEDI_OPT/modulefiles/core

If you wish to use the intel compiler suite, the preferred jedi modules are those from 2020.2:

module purge
module load jedi/intel-impi/2020.2

If you wish to use the gnu compiler suite with the openmpi library, enter:

module purge
module load jedi/gnu-openmpi

It is not required, but if you wish to use version 18 of the intel compilers and mpi libraries, we also maintain modules for that. To use the intel 18 modules, enter the following commands in addition to the corresponding JEDI_OPT commands described above:

# replace with setenv if you use tcsh, as above
export JEDI_OPT2=/home/role.jedipara/opt/modules
module use $JEDI_OPT2/modulefiles/core
module purge
module load jedi/intel-impi/18

It is important to note that the JEDI modules may conflict with other modules provided by other developers on Hera, particularly for installations of HDF5 and NetCDF. The Hera sysadmins have provided their own builds of HDF5 and NetCDF (in /apps/modules/modulefamilies/intel) and netcdf-hdf5parallel (in /apps/modules/modulefamilies/intel_impi). Unfortunately, these libraries have incompatible versions and compile-time options that conflict with the JEDI components. For a JEDI-related project, use our modules. If modules have been mixed, you can unload all modules and start over with module purge.

Also, it is recommended that you specify srun as your mpi process manager when building, like so:

ecbuild -DMPIEXEC_EXECUTABLE=`which srun` -DMPIEXEC_NUMPROC_FLAG="-n" <path-to-bundle>
make -j4

To run tests with slurm and srun, you also need to have the following environment variables defined:

export SLURM_ACCOUNT=<account you can run slurm jobs under>
export SALLOC_ACCOUNT=$SLURM_ACCOUNT
export SBATCH_ACCOUNT=$SLURM_ACCOUNT

Orion

Orion is an HPC system located at Mississippi State University for the purpose of furthering NOAA’s scientific research and collaboration.

A few steps are necessary to access the installed jedi modules. The following bash shell commands are necessary to access the installed jedi modules (substitute equivalent csh shell commands as appropriate):

export JEDI_OPT=/work/noaa/da/jedipara/opt/modules
module use $JEDI_OPT/modulefiles/core

Currently there are two sets of compiler / MPI module suites available to load (choose only one):

Intel Parallel Studio version 2020 update 2 (which contains version 20.2.254 of the compiler suite):

module load jedi/intel-impi # Intel compiler suite with intel MPI

and version 10.2.0 of the GNU compiler suite, with OpenMPI v4.0.4

module load jedi/gnu-openmpi # GNU compilers with OpenMPI

Orion uses the slurm task manager for parallel mpi jobs. Though some slurm implementations allow you to use the usual mpi job scripts mpirun or mpiexec, these may not function properly on orion. Instead, you are advised to use the slurm run script srun; an appropriate jedi cmake toolchain is available to set this up.

First, clone the jedi-cmake repository:

git clone [email protected]:jcsda/jedi-cmake.git

Then pass the following toolchain to ecbuild, and use multiple threads to speed up the compilation:

git clone https://github.com/jcsda/<jedi-bundle>
mkdir -p jedi/build; cd jedi/build
ecbuild --toolchain=<path-to-jedi-cmake>/jedi-cmake/cmake/Toolchains/jcsda-Orion-Intel.cmake <path-to-bundle>
make -j4

Note

If you cloned the jedi-cmake repository as part of building a jedi bundle, then the name of the repository may be jedicmake instead of jedi-cmake.

Alternatively, you can specify the MPI executable directly on the command line:

ecbuild -DMPIEXEC_EXECUTABLE=/opt/slurm/bin/srun -DMPIEXEC_NUMPROC_FLAG="-n" <path-to-bundle>
make -j4

Note that this specifying srun as the MPI executable is really only necessary for the ctests. If you run an application directly (outside of ctest), you may simply use srun.

Here is a sample slurm batch script for running ctest. Note that you will need to add appropriate #SBATCH directives for specifying a computing account, quality of service, job partition, and so on; please consult the Orion Usage and Guidelines documentation.

#!/usr/bin/bash
#SBATCH --job-name=<name>
#SBATCH --nodes=1
#SBATCH --account <account>
#SBATCH --partition <partition>
#SBATCH --qos <qos>
#SBATCH --time=0:10:00
#SBATCH --mail-user=<email-address>

source /etc/bashrc
module purge
export JEDI_OPT=/work/noaa/da/jedipara/opt/modules
module use $JEDI_OPT/modulefiles/core
module load jedi/intel-impi
module list
ulimit -s unlimited
ulimit -v unlimited

export SLURM_EXPORT_ENV=ALL
export HDF5_USE_FILE_LOCKING=FALSE

cd <path-to-bundle-build-directory>
ctest -E get_

exit 0

Note that the options specified with #SBATCH include the number of nodes but not the number of tasks needed. This is most appropriate for running ctest because some tests require a different number of MPI tasks than others. However, if you run an application individually, you should specify #SBATCH --ntasks <number> instead of #SBATCH --nodes=<number>, as shown in the following example. The slurm job scheduler will properly determine how many nodes your job requires. Specifying --ntasks instead of --nodes in the #SBATCH header commands will mandate that your computing allocation will only be charged for what you use. This is preferable for more computationally intensive jobs:

#!/usr/bin/bash
#SBATCH --job-name=<name>
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=1
#SBATCH --time=0:10:00
#SBATCH --mail-user=<email-address>

source /etc/bashrc
module purge
export JEDI_OPT=/work/noaa/da/jedipara/opt/modules
module use $JEDI_OPT/modulefiles/core
module load jedi/gnu-openmpi
module list
ulimit -s unlimited
ulimit -v unlimited

export SLURM_EXPORT_ENV=ALL
export HDF5_USE_FILE_LOCKING=FALSE

# make sure the number of tasks it requires matches the SBATCH --ntasks specification above
cd <path-to-bundle-build-directory>
srun --ntasks=4 --cpu_bind=core --distribution=block:block test_ufo_radiosonde_opr testinput/radiosonde.yaml

exit 0

Submit and monitor your jobs with these commands

sbatch <batch-script>
squeue -u <your-user-name>

You can delete jobs with the scancel command. For further information please consult the Orion Cluster Computing Basics documentation.

Cheyenne

Cheyenne is a 5.34-petaflops, high-performance computer built for NCAR by SGI. On Cheyenne, users can access the installed jedi modules by first entering

module purge
export JEDI_OPT=/glade/work/jedipara/cheyenne/opt/modules
module use $JEDI_OPT/modulefiles/core

Current options for setting up the JEDI environment include (choose only one)

module load jedi/gnu-openmpi # GNU compiler suite and openmpi
module load jedi/intel-impi # Intel 19.0.5 compiler suite and Intel mpi

Because of space limitations on your home directory, it’s a good idea to locate your build directory on the glade filesystems:

cd /glade/work/<username>
mkdir jedi/build; cd jedi/build

If you choose the jedi/gnu-openmpi module, you can proceed run ecbuild as you would on most other systems:

ecbuild <path-to-bundle>
make update
make -j4

Warning

Please do not use too many threads to speed up the compilation, Cheyenne system administrator might terminate your login node.

However, if you choose to compile with the jedi/intel-impi module you must use a toolchain. This is required in order enable C++14 and to link to the proper supporting libraries.

First clone the jedi-cmake repository:

git clone [email protected]:jcsda/jedi-cmake.git

Then pass this toolchain to ecbuild:

ecbuild --toolchain=<path-to-jedi-cmake>/jedi-cmake/cmake/Toolchains/jcsda-Cheyenne-Intel.cmake <path-to-bundle>

Note

If you cloned the jedi-cmake repository as part of building a jedi bundle, then the name of the repository may be jedicmake instead of jedi-cmake.

The system configuration on Cheyenne will not allow you to run mpi jobs from the login node. If you try to run ctest from here, the mpi tests will fail. To run the jedi unit tests you will have to either submit a batch job or request an interactive session with qsub -I. The following is a sample batch script to run the unit tests for ufo-bundle. Note that some ctests require up to 6 MPI tasks so requesting 6 cores should be sufficient.

#!/bin/bash
#PBS -N ctest-ufo-gnu
#PBS -A <account-number>
#PBS -l walltime=00:20:00
#PBS -l select=1:ncpus=6:mpiprocs=6
#PBS -q regular
#PBS -j oe
#PBS -k eod
#PBS -m abe
#PBS -M <your-email>

source source /etc/profile.d/modules.sh
module purge
export JEDI_OPT=/glade/work/jedipara/cheyenne/opt/modules
module use $JEDI_OPT/modulefiles/core
module load jedi/gnu-openmpi
module list

# cd to your build directory.  Make sure that these binaries were built
# with the same module that is loaded above, in this case jedi/intel-impi

cd <build-directory>

# now run ctest
ctest -E get_

Casper

The Casper cluster is a heterogeneous system of specialized data analysis and visualization resources, large-memory, multi-GPU nodes, and high-throughput computing nodes. On Casper, users can access the installed jedi modules by first entering

module purge
export JEDI_OPT=/glade/work/jedipara/casper/opt/modules
module use $JEDI_OPT/modulefiles/core

Current options for setting up the JEDI environment include (choose only one)

module load jedi/gnu-openmpi # GNU compiler suite and openmpi
module load jedi/intel-impi # Intel 19.0.5 compiler suite and Intel mpi

Because of space limitations on your home directory, it’s a good idea to locate your build directory on the glade filesystems:

cd /glade/work/<username>
mkdir jedi/build; cd jedi/build

If you choose the jedi/gnu-openmpi module, you can proceed run ecbuild as you would on most other systems:

ecbuild <path-to-bundle>
make update
make -j4

Warning

Please do not use too many threads to speed up the compilation, Casper system administrator might terminate your login node.

However, if you choose to compile with the jedi/intel-impi module you must use a toolchain. This is required in order enable C++14 and to link to the proper supporting libraries.

First clone the jedi-cmake repository:

git clone [email protected]:jcsda/jedi-cmake.git

Then pass this toolchain to ecbuild:

ecbuild --toolchain=<path-to-jedi-cmake>/jedi-cmake/cmake/Toolchains/jcsda-Casper-Intel.cmake <path-to-bundle>

Note

If you cloned the jedi-cmake repository as part of building a jedi bundle, then the name of the repository may be jedicmake instead of jedi-cmake.

The system configuration on Casper will not allow you to run mpi jobs from the login node. If you try to run ctest from here, the mpi tests will fail. To run the jedi unit tests you will have to either submit a batch job or request an interactive session with execcasper. Invoking it without an argument will start an interactive shell on the first available HTC node. The default wall-clock time is 6 hours. To use another type of node, include a select statement specifying the resources you need. The execcasper command accepts all PBS flags and resource specifications as detailed by man qsub.

The following is a sample batch script to run the unit tests for ufo-bundle. Note that some ctests require up to 6 MPI tasks so requesting 6 cores should be sufficient.

#!/bin/bash
#PBS -N ctest-ufo-gnu
#PBS -A <project-code>
#PBS -l walltime=00:20:00
#PBS -l select=1:ncpus=6:mpiprocs=6
#PBS -q casper
#PBS -j oe
#PBS -k eod
#PBS -m abe
#PBS -M <your-email>

source source /etc/profile.d/modules.sh
module purge
export JEDI_OPT=/glade/work/jedipara/casper/opt/modules
module use $JEDI_OPT/modulefiles/core
module load jedi/gnu-openmpi
module list

# cd to your build directory.  Make sure that these binaries were built
# with the same module that is loaded above, in this case jedi/intel-impi

cd <build-directory>

# now run ctest
ctest -E get_

Discover

Discover is 90,000 core supercomputing cluster capable of delivering 3.5 petaflops of high-performance computing for Earth system applications from weather to seasonal to climate predictions.

To access the jedi modules on Discover, it is recommended that you add this to your $HOME/.bashrc file (or the equivalent if you use another shell):

export JEDI_OPT=/discover/swdev/jcsda/modules
module use $JEDI_OPT/modulefiles/core
module use $JEDI_OPT/modulefiles/apps

Currently two stacks are maintained (choose only one)

module load jedi/intel-impi
module load jedi/gnu-impi

The second option may seem a little surprising, pairing the gnu 9.2.0 compiler suite with the intel 19.1.0.166 mpi library. However, this is intentional. Intel MPI is currently the recommended MPI library on SLES-12 for both Intel and gnu compilers. Note that OpenMPI is not yet available on SLES-12, though they do have hpcx, which is a proprietary variant of OpenMPI from Mellanox.

Each of these jedi modules defines the environment variable MPIEXEC which points to the recommended mpirun executable and which should then be explicitly specified when you build jedi:

ecbuild -DMPIEXEC_EXECUTABLE=$MPIEXEC -DMPIEXEC_NUMPROC_FLAG="-np" <path-to-bundle>

There is also another module that is built from the ESMA baselibs libraries. To use this, enter:

module purge
module load jedi/baselibs/intel-impi

Currently only intel-impi/19.1.0.166 is the only baselibs option available but more may be added in the future. Specify the MPI executable explicitly when you build as with the previous modules.

ecbuild -DMPIEXEC_EXECUTABLE=$MPIEXEC -DMPIEXEC_NUMPROC_FLAG="-np" <path-to-bundle>
make -j4

Whichever module you use, after building you will want to run the get tests from the login node to get the test data from AWS S3:

ctest -R get_

To run the remaining tests, particularly those that require MPI, you’ll need to acquire a compute node. You can do this interactively with

salloc --nodes=1 --time=30

Or, you can submit a batch script to the queue through sbatch as described in the S4 instructions below.

S4

S4 is the Satellite Simulations and Data Assimilation Studies supercomputer located at the University of Wisconsin-Madison’s Space Science and Engineering Center.

The S4 system currently only supports intel compilers. Furthermore, S4 uses the slurm task manager for parallel mpi jobs. Though some slurm implementations allow you to use the usual mpi job scripts mpirun or mpiexec, these may not function properly on S4. Instead, you are advised to use the slurm run script srun.

To load the JEDI intel module you can use the following commands (as on other systems, you can put the first two lines in your ~/.bashrc file for convenience):

export JEDI_OPT=/data/prod/jedi/opt/modules
module use $JEDI_OPT/modulefiles/core
module load jedi/intel-impi

The recommended way to compile JEDI on S4 is to first clone the jedi-cmake repository, which contains an S4 toolchain:

git clone [email protected]:jcsda/jedi-cmake.git

Then pass this toolchain to ecbuild:

ecbuild --toolchain=<path-to-jedi-cmake>/jedi-cmake/cmake/Toolchains/jcsda-S4-Intel.cmake <path-to-bundle>

Note

If you cloned the jedi-cmake repository as part of building a jedi bundle, then the name of the repository may be jedicmake instead of jedi-cmake.

Alternatively, you can specify the MPI executable directly on the command line:

ecbuild -DMPIEXEC_EXECUTABLE=/usr/bin/srun -DMPIEXEC_NUMPROC_FLAG="-n" <path-to-bundle>
make -j4

Note that this specifying srun as the MPI executable is only really necessary for the ctests. If you run an application directly (outside of ctest), you can just use srun.

Here is a sample slurm batch script for running ctest.

#!/usr/bin/bash
#SBATCH --job-name=<name>
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --time=0:10:00
#SBATCH --mail-user=<email-address>

source /etc/bashrc
module purge
export JEDI_OPT=/data/prod/jedi/opt/modules
module use $JEDI_OPT/modulefiles/core
module load jedi/intel-impi
module list
ulimit -s unlimited

export SLURM_EXPORT_ENV=ALL
export HDF5_USE_FILE_LOCKING=FALSE

cd <path-to-bundle-build-directory>
ctest -E get_

exit 0

Note that the options specified with #SBATCH include the number of nodes but not the number of tasks needed. This is most appropriate for running ctest because some tests require a different number of MPI tasks than others. However, if you run an application individually, you should specify #SBATCH --ntasks <number> instead of #SBATCH --nodes=<number>, as shown in the following example. The slurm job scheduler will then determine how many nodes you need. For example, if you are running with the ivy partition as shown here, then each node has 20 cpu cores. So, if your application takes more than 20 MPI tasks, slurm will allocate more than one node. Specifying --ntasks instead of --nodes in the #SBATCH header commands will ensure that your computing allocation will only be charged for what you use. So, this is preferable for more computationally intensive jobs:

#!/usr/bin/bash
#SBATCH --job-name=<name>
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=1
#SBATCH --time=0:10:00
#SBATCH --mail-user=<email-address>

source /etc/bashrc
module purge
export JEDI_OPT=/data/prod/jedi/opt/modules
module use $JEDI_OPT/modulefiles/core
module load jedi/intel-impi
module list
ulimit -s unlimited

export SLURM_EXPORT_ENV=ALL
export HDF5_USE_FILE_LOCKING=FALSE

# make sure the number of tasks it requires matches the SBATCH --ntasks specification above
cd <path-to-bundle-build-directory>/test/ufo
srun --ntasks=4 --cpu_bind=core --distribution=block:block test_ufo_radiosonde_opr testinput/radiosonde.yaml

exit 0

Then you can submit and monitor your jobs with these commands

sbatch <batch-script>
squeue -u <your-user-name>

You can delete jobs with the scancel command. For further information please consult the S4 user documentation.