.. _condor_gluon: Batch System ==================== This guide serves as a comprehensive resource, designed to provide both an overview and a practical guide to the fundamental commands and usage patterns of the GLUON cluster's job scheduler. In the realm of high-performance computing, efficient and effective job scheduling is crucial. GLUON utilizes HTCondor, a renowned and open-source distributed job management system, to orchestrate computational tasks across its network. HTCondor is specifically engineered to optimize the utilization of computing resources. It enables users to queue, manage, and monitor jobs within a distributed computing infrastructure. Due to its flexibility, power, and capability to handle a diverse range of tasks, HTCondor is an ideal choice for environments where resource management and task scheduling are of paramount importance. Basic Commands --------------- Below is a basic guide for users, covering essential commands and an example submission file named `hello_world.sub`. 1. **Submitting a Job** To submit a job to HTCondor, use the `condor_submit` command followed by your submission file name. .. raw:: html
[gluon_user@glui01 ~]$  condor_submit hello_world.sub
2. **Monitoring Your Job** The `condor_q` command displays the status of your submitted jobs. .. raw:: html
[gluon_user@glui01 ~]$  condor_q
3. **Removing a Job** To remove a job from the queue, use the `condor_rm` command with your job ID. .. raw:: html
[gluon_user@glui01 ~]$  condor_rm [Job ID]
4. **Job Status** You can check the status of all jobs using `condor_status`. .. raw:: html
[gluon_user@glui01 ~]$  condor_status
HTCondor Vanilla Universe ------------------------------------------------------------ The vanilla universe in HTCondor is intended for most programs. Shell scripts are another case where the vanilla universe is useful. To execute a specific command or program through Condor using the Vanilla Universe, it is most appropriate to create a bash executable. In this way, what is sent to the Job Scheduler is this executable. Let's imagine we want to execute the Python3 code hello_world.py through HTCondor. First, we will create the bash script (hello_world.sh). .. code-block:: bash #!/bin/bash ##Uncoment this part if you uses conda environment ##------------------------------- #EXERCISE_ENVIRONMENT="environment_name" #eval "$(conda shell.bash hook)" #conda activate $EXERCISE_ENVIRONMENT ##-------------------------------- ##script ##---------- ##executable arg_1 arg_2 python3 hello_world.py Once the bash execution script is created, we must give it execution permissions. To do this, simply write the following command in the CLI: .. raw:: html
[gluon_user@glui01 ~]$  chmod +x hello_world.sh
Now that the file has execution permissions, we can upload it to HTCondor. To do this, we need to create the submission file (hello_world.sub). Here is an example of it for Vanilla: .. code-block:: bash ######################################### #Not modify this part universe = vanilla ######################################### #name of the bash script executable = hello_world.sh arguments = $(Process) #Path to the log files (In your first run, make the directory condor_logs) log = condor_logs/logs.logs output = condor_logs/outfile.$(Cluster).$(Process).out error = condor_logs/errors.$(Cluster).$(Process).err ##Uncoment this line if you use conda environments #getenv = True #number of CPUs requested request_cpus = 3 # Memory requested request_memory = 2 GB # Time requested +JobFlavour = "microcentury" #For cases where the required files are not in the /home directory ################################## #should_transfer_files = yes #when_to_transfer_output = on_exit #transfer_input_files = file_1 file_2 #send to the queue queue Be very careful, it is very important to choose a jobflavour; by default, GLUON assumes the job will be short and assigns 20 minutes. The Job flavours establish the time limit that the job will be running. Please choose the most appropriate one for the specific job you want to launch, as if GLUON has a high workload at the time of sending the job, this parameter will be taken into account when setting priorities in the queue. Longer jobs will have lower priorities during peak computational load hours. The description of the different jobflavours available in GLUON can be found in the section :ref:`basics`. Detailed Explanation of an HTCondor Vanilla Universe Submission File .............................................................................. - ``universe = vanilla``: Specifies the HTCondor "universe" for the job. The `vanilla` universe is a basic execution environment suitable for most jobs. - ``executable = hello_world.sh``: This line defines the script or executable that HTCondor will run, which in this case is `hello_world.sh`. - ``arguments = $(Process)``: Sets the arguments to be passed to the executable. `$(Process)` is a built-in HTCondor variable indicating the process number in a batch of jobs. - ``log = condor_logs/log.log``: Path to the file where HTCondor will write the job's log. - ``output = condor_logs/outfile.$(Cluster).$(Process).out``: Specifies the path and file name for standard output. `$(Cluster)` and `$(Process)` are variables representing the cluster ID and process number, respectively. - ``error = condor_logs/errors.$(Cluster).$(Process).err``: Path and name for the file where standard error output will be written. - ``getenv = True``: Uncommenting this line allows the HTCondor job to inherit environment variables from the submitting environment, useful for Conda environments or specific environment variables. - ``request_cpus = 3``: Indicates the number of CPUs requested for the job. - ``request_memory = 2 GB``: Indicates the memory requested for the job. - ``should_transfer_files = yes``: Instructs whether files should be transferred to the execution node. - ``when_to_transfer_output = on_exit``: Determines when to transfer output files, in this case, upon job completion. - ``transfer_input_files = file_1, file_2``: Lists the files to be transferred to the execution node. - ``queue``: This command places the job into the HTCondor queue for execution. Without this line, the job would not be submitted to the queue. Each line in this submission file configures how HTCondor will handle and execute your job, from setting the execution environment to specifying system resources and requirements. HTCondor Parallel Universe ---------------------------- The parallel universe allows parallel programs, such as MPI jobs, to be run within HTCondor. This allows for the use of GLUON's worker nodes for parallel computing. Imagine we want to launch our C++ program hollo_world_MPI.cpp through HTCondor using several worker nodes. First, we would need to compile the program on the User Interface, exactly the same way as if we were launching it directly via CLI. .. raw:: html
[gluon_user@glui01 ~]$  /usr/mpi/gcc/openmpi-4.1.5a1/bin/mpicxx -o hello_world_mpi hello_world_mpi.cpp
Although GLUON allows for compilation on worker nodes, it's usually more convenient to compile on the User Interface. Next, we will have to send a submission file (hello_world_mpi.sub) with the following structure to the scheduler: .. code-block:: bash ######################################### #Not modify this part universe = parallel executable = /usr/share/doc/condor-23.6.2/examples/openmpiscript ######################################### # mpi executable and arguments: #arguments = executable arg1 arg2 arg3 arguments = hello_world_mpi # Number of machines requested machine_count = 2 # CPUs per machine request_cpus = 45 # Memory requested request_memory = 2 GB # Time requested +JobFlavour = "microcentury" #Path to the log files (In your first run, make the directory condor_logs) log = condor_logs/logs.log output = condor_logs/$(Cluster).$(machine_count).$(request_cpus).$(NODE).out error = condor_logs/$(Cluster).$(machine_count).$(request_cpus).$(NODE).err +ParallelShutdownPolicy = "WAIT_FOR_ALL" #For cases where the required files are not in the /home directory ################################## #should_transfer_files = yes #when_to_transfer_output = on_exit #transfer_input_files = file_1 file_2 #send to the queue queue In this way, in our example, the total number of CPUs used will be **request_cpus** x **machine_count** = 90 **CPUs**. The output obtained when launching our code through HTCondor will be: .. code-block:: bash Hello World from the main process (rank 0) of 90 processes. Hello World from the main process 3 de 90. Hello World from the main process 62 de 90. Hello World from the main process 4 de 90. ... Detailed Explanation of an HTCondor Parallel Universe Submission File ....................................................................... - ``universe = parallel``: Specifies the HTCondor universe as `parallel`. This universe is used for parallel jobs, typically involving MPI (Message Passing Interface). - ``executable = /usr/share/doc/condor-10.9.0/examples/openmpiscript``: The path to the executable that HTCondor will run. In this case, it is an MPI script located at `/usr/share/doc/condor-10.9.0/examples/openmpiscript`. It is specific to HTCondor and does not require modification. However, if the structure being attempted to execute necessitates changes, it can be copied to the local directory and the appropriate modifications can be made. - ``arguments = hello_world_mpi``: Defines the arguments passed to the MPI script. Our program hello_world_mpi, followed by all the necessary arguments to execute it. - ``machine_count = 2``: Specifies the number of Worker Nodes requested for the parallel job. - ``request_cpus = 45``: Indicates the number of CPUs per Worker Node requested for the job. - ``request_memory = 2 GB``: Indicates the number of CPUs per Worker Node requested for the job. - ``log = condor_logs/logs.log``: Path for the logs file. - ``output = condor_logs/$(Cluster).$(machine_count).$(request_cpus).$(NODE).out``: Path and filename pattern for standard output. `$(NODE)` is a variable specific to parallel universe jobs. - ``error = condor_logs/$(Cluster).$(machine_count).$(request_cpus).$(NODE).err``: Path and filename pattern for standard error output. - ``+ParallelShutdownPolicy = "WAIT_FOR_ALL"``: This line specifies the shutdown policy for parallel jobs. `WAIT_FOR_ALL` means the job will not complete until all parallel nodes have completed. - ``should_transfer_files = yes``: Determines if files need to be transferred to the execution node. - ``when_to_transfer_output = on_exit``: Specifies when to transfer output files, usually upon job completion. - ``transfer_input_files = file_1, file_2``: Lists files to be transferred to the execution node. - ``queue``: Places the job into the HTCondor queue for execution.