Worker Nodes use

This guide serves as a comprehensive resource, designed to provide both an overview and a practical guide to the fundamental commands and usage patterns of the GLUON cluster’s job scheduler. In the realm of high-performance computing, efficient and effective job scheduling is crucial. GLUON utilizes HTCondor, a renowned and open-source distributed job management system, to orchestrate computational tasks across its network.

HTCondor is specifically engineered to optimize the utilization of computing resources. It enables users to queue, manage, and monitor jobs within a distributed computing infrastructure. Due to its flexibility, power, and capability to handle a diverse range of tasks, HTCondor is an ideal choice for environments where resource management and task scheduling are of paramount importance.

Basic Commands

Below is a basic guide for users, covering essential commands and an example submission file named hello_world.sub.

  1. Submitting a Job To submit a job to HTCondor, use the condor_submit command followed by your submission file name.

    [gluon_user@glui01 ~]$  condor_submit hello_world.sub
  2. Monitoring Your Job The condor_q command displays the status of your submitted jobs.

    [gluon_user@glui01 ~]$  condor_q
  3. Removing a Job To remove a job from the queue, use the condor_rm command with your job ID.

    [gluon_user@glui01 ~]$  condor_rm [Job ID]
  4. Job Status You can check the status of all jobs using condor_status.

    [gluon_user@glui01 ~]$  condor_status

HTCondor Vanilla Universe

The vanilla universe in HTCondor is intended for most programs. Shell scripts are another case where the vanilla universe is useful.

To execute a specific command or program through Condor using the Vanilla Universe, it is most appropriate to create a bash executable. In this way, what is sent to the Job Scheduler is this executable. Let’s imagine we want to execute the Python3 code hello_world.py through HTCondor. First, we will create the bash script (hello_world.sh).

#!/bin/bash

##Uncoment this part if you uses conda environment
##-------------------------------
#EXERCISE_ENVIRONMENT="environment_name"
#eval "$(conda shell.bash hook)"
#conda activate $EXERCISE_ENVIRONMENT
##--------------------------------


##script
##----------
##executable arg_1 arg_2

python3 hello_world.py

Once the bash execution script is created, we must give it execution permissions. To do this, simply write the following command in the CLI:

[gluon_user@glui01 ~]$  chmod +x hello_world.sh

Now that the file has execution permissions, we can upload it to HTCondor. To do this, we need to create the submission file (hello_world.sub). Here is an example of it for Vanilla:

      #########################################
      #Not modify this part
      universe = vanilla
      #########################################

      #name of the bash script
      executable              = hello_world.sh
      arguments               = $(Process)

      #Path to the log files (In your first run, make the directory condor_logs)
      log                     = condor_logs/logs.logs
      output                  = condor_logs/outfile.$(Cluster).$(Process).out
      error                   = condor_logs/errors.$(Cluster).$(Process).err

      ##Uncoment this line if you use conda environments
      #getenv = True

      #number of CPUs requested
      request_cpus = 3

      # Memory requested
      request_memory = 2 GB

# Time requested

+JobFlavour = "microcentury"

      #For cases where the required files are not in the /home directory
      ##################################
      #should_transfer_files = yes
      #when_to_transfer_output = on_exit
      #transfer_input_files = file_1 file_2

      #Requirements
      #####################
      #To exclude specific machines
      #requirements = ( Machine != "glwn01.ific.uv.es" && Machine != "glwn02.ific.uv.es" )
      #To run in a specific machine
      #requirements = ( Machine == "glwn03.ific.uv.es")

      #send to the queue
      queue

Be very careful, it is very important to choose a jobflavour; by default, GLUON assumes the job will be short and assigns 20 minutes. The Job flavours establish the time limit that the job will be running. Please choose the most appropriate one for the specific job you want to launch, as if GLUON has a high workload at the time of sending the job, this parameter will be taken into account when setting priorities in the queue. Longer jobs will have lower priorities during peak computational load hours. The description of the different jobflavours available in GLUON can be found in the section Basics.

Detailed Explanation of an HTCondor Vanilla Universe Submission File

  • universe = vanilla: Specifies the HTCondor “universe” for the job. The vanilla universe is a basic execution environment suitable for most jobs.

  • executable = hello_world.sh: This line defines the script or executable that HTCondor will run, which in this case is hello_world.sh.

  • arguments = $(Process): Sets the arguments to be passed to the executable. $(Process) is a built-in HTCondor variable indicating the process number in a batch of jobs.

  • log = condor_logs/log.log: Path to the file where HTCondor will write the job’s log.

  • output = condor_logs/outfile.$(Cluster).$(Process).out: Specifies the path and file name for standard output. $(Cluster) and $(Process) are variables representing the cluster ID and process number, respectively.

  • error = condor_logs/errors.$(Cluster).$(Process).err: Path and name for the file where standard error output will be written.

  • getenv = True: Uncommenting this line allows the HTCondor job to inherit environment variables from the submitting environment, useful for Conda environments or specific environment variables.

  • request_cpus = 3: Indicates the number of CPUs requested for the job.

  • request_memory = 2 GB: Indicates the memory requested for the job.

  • should_transfer_files = yes: Instructs whether files should be transferred to the execution node.

  • when_to_transfer_output = on_exit: Determines when to transfer output files, in this case, upon job completion.

  • transfer_input_files = file_1, file_2: Lists the files to be transferred to the execution node.

  • requirements = (Machine != "glwn01.ific.uv.es" && Machine != "glwn02.ific.uv.es"): Sets specific requirements for the execution machine, excluding certain machines.

  • requirements = (Machine == "glwn03.ific.uv.es"): Restricts job execution to a specific machine.

  • queue: This command places the job into the HTCondor queue for execution. Without this line, the job would not be submitted to the queue.

Each line in this submission file configures how HTCondor will handle and execute your job, from setting the execution environment to specifying system resources and requirements.

HTCondor Parallel Universe

The parallel universe allows parallel programs, such as MPI jobs, to be run within HTCondor. This allows for the use of GLUON’s worker nodes for parallel computing.

Imagine we want to launch our C++ program hollo_world_MPI.cpp through HTCondor using several worker nodes. First, we would need to compile the program on the User Interface, exactly the same way as if we were launching it directly via CLI.

[gluon_user@glui01 ~]$  /usr/mpi/gcc/openmpi-4.1.5a1/bin/mpicxx -o hello_world_mpi hello_world_mpi.cpp

Although GLUON allows for compilation on worker nodes, it’s usually more convenient to compile on the User Interface. Next, we will have to send a submission file (hello_world_mpi.sub) with the following structure to the scheduler:

#########################################
#Not modify this part
universe = parallel
executable = /usr/share/doc/condor-10.9.0/examples/openmpiscript
#########################################

# mpi executable and arguments:
#arguments = executable arg1 arg2 arg3
arguments = hello_world_mpi

# Number of machines requested
machine_count = 2
# CPUs per machine
request_cpus = 45
# Memory requested
request_memory = 2 GB

#Path to the log files (In your first run, make the directory condor_logs)
log                     = condor_logs/logs.log
output                  = condor_logs/$(Cluster).$(machine_count).$(request_cpus).$(NODE).out
error                   = condor_logs/$(Cluster).$(machine_count).$(request_cpus).$(NODE).err

+ParallelShutdownPolicy = "WAIT_FOR_ALL"

#For cases where the required files are not in the /home directory
##################################
#should_transfer_files = yes
#when_to_transfer_output = on_exit
#transfer_input_files = file_1 file_2

#Requirements
#####################
#To exclude specific machines
#requirements = ( Machine != "glwn01.ific.uv.es" && Machine != "glwn02.ific.uv.es" )
#To run in a specific machine
#requirements = ( Machine == "glwn03.ific.uv.es")

#send to the queue
queue

In this way, in our example, the total number of CPUs used will be request_cpus x machine_count = 90 CPUs. The output obtained when launching our code through HTCondor will be:

Hello World from the main process (rank 0) of 90 processes.
Hello World from the main process 3 de 90.
Hello World from the main process 62 de 90.
Hello World from the main process 4 de 90.
...

Detailed Explanation of an HTCondor Parallel Universe Submission File

  • universe = parallel: Specifies the HTCondor universe as parallel. This universe is used for parallel jobs, typically involving MPI (Message Passing Interface).

  • executable = /usr/share/doc/condor-10.9.0/examples/openmpiscript: The path to the executable that HTCondor will run. In this case, it is an MPI script located at /usr/share/doc/condor-10.9.0/examples/openmpiscript. It is specific to HTCondor and does not require modification. However, if the structure being attempted to execute necessitates changes, it can be copied to the local directory and the appropriate modifications can be made.

  • arguments = hello_world_mpi: Defines the arguments passed to the MPI script. Our program hello_world_mpi, followed by all the necessary arguments to execute it.

  • machine_count = 2: Specifies the number of Worker Nodes requested for the parallel job.

  • request_cpus = 45: Indicates the number of CPUs per Worker Node requested for the job.

  • request_memory = 2 GB: Indicates the number of CPUs per Worker Node requested for the job.

  • log = condor_logs/logs.log: Path for the logs file.

  • output = condor_logs/$(Cluster).$(machine_count).$(request_cpus).$(NODE).out: Path and filename pattern for standard output. $(NODE) is a variable specific to parallel universe jobs.

  • error = condor_logs/$(Cluster).$(machine_count).$(request_cpus).$(NODE).err: Path and filename pattern for standard error output.

  • +ParallelShutdownPolicy = "WAIT_FOR_ALL": This line specifies the shutdown policy for parallel jobs. WAIT_FOR_ALL means the job will not complete until all parallel nodes have completed.

  • should_transfer_files = yes: Determines if files need to be transferred to the execution node.

  • when_to_transfer_output = on_exit: Specifies when to transfer output files, usually upon job completion.

  • transfer_input_files = file_1, file_2: Lists files to be transferred to the execution node.

  • requirements = (Machine != "gr01" && Machine != "gr02"): Sets specific requirements for the execution machine, excluding certain machines.

  • requirements = (Machine == "gr03"): Restricts job execution to a specific machine.

  • queue: Places the job into the HTCondor queue for execution.