Run jobs on a supercomputer

There are two ways to run a job on a supercomputer. One way is to enter commands in real time (interactive mode) as if you are sitting in front of your personal computer, except that it now has much greater computing power. This way is quite convenient for short jobs. But as soon as you are disconnected from the supercomputer or as your session is expired, your running job will be terminated. Interactive mode is not suitable for long jobs. The other way is to submit a batch job. Imagine you turn in a job application and then go about your normal life. You don’t need to sit in front of the supercomputer to keep the connection alive. The supercomputer will silently do the job that you submitted and will notify you when it finishes. This is the batch mode.

To run the interactive mode with SLURM scheduler system, you can use either srun or salloc. They function differently but are similar in many ways. This article, printed from the website of the Computer Science Department of Colorado State University, nicely explains their differences. On SGE scheduler system, you run the interactive mode by qrsh or qlogin.

Whichever way you use, there will be a limit on how much resource you can request. This is because a supercomputer is designed for multiple concurrent users. There may be a cap on the number of jobs one can submit per day or week. There may be a cap on time, memory, number of cores that a job can use. To find out what those limits are on SLURM, use the following command:

sacctmgr show qos format=Name,Priority,Flags,MaxWall,MaxJobsPU,MaxSubmitPU,MaxTres

To run a batch mode, you can use sbatch. The command in the bash shell is quite simple: sbatch batchscriptname.sh. In the following example, we write a batch script called “sample-batchscript.sh” to allocate resource to run a Python program called “sample-py.py”. The job name is a string carrying the values of three variables a,b,c. The returned value is the sum a+b+c. The batch script file sample-batchscript.sh is as follows:

#!/bin/bash
#
# Scheduler specific section
# --------------------------
#SBATCH --job-name="a=11,b=21,c=3" # a name for your job
#SBATCH --nodes=1                  # node count, default is 1
#SBATCH --ntasks=2                 # total number of tasks across all nodes
#SBATCH --cpus-per-task=1          # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem-per-cpu=1M           # memory per cpu-core
#SBATCH --time=00:00:05            # total run time limit (HH:MM:SS)
#SBATCH --output=%j,%x.out 	       # name of output log file, default is slurm-%j.out
#SBATCH --mail-type=BEGIN,END,FAIL # send email when job begins, ends, or fails
#SBATCH --mail-user=youremailaddress
#
# Job specific section
# -----------------------
ulimit -s unlimited
printf "Job started at %s on %s \n" "$(date '+%H:%M:%S')" "$(date '+%m/%d/%Y')" 
echo ----------------Submitted batch script-----------
scontrol write batch_script $SLURM_JOB_ID - | sed '18,21d' | head -n -2
echo ------------End of submitted batch script-----------
mpirun -n $SLURM_NTASKS python3 sample-py.py $SLURM_JOB_NAME #mpiexec can replace mpirun
echo ------------Job summary-----------
scontrol show job $SLURM_JOB_ID

The first line starting with #! is to tell the bash shell that this is an executable file. The lines starting with #SBATCH is for SLURM to allocate resource. All other lines starting with # are comments and are ignored by the bash shell. After submitting the batch script with sbatch, SLURM will make a copy of the script so that the version that SLURM will work on is not affected by any changes you will make in your script file. An analogy is that after you submit your job application, you cannot alter it. You can alter the version you have in hand, but you cannot alter the version you just sent away. The codes between echo…echo is to print the version of the batch script at the time it is submitted. The part sed ‘18,21d’ and head -n -2 are to skip printing lines 18-21 and the last two lines in the batch script file.

The Python program sample-py.py is as follows:

import sys

def maincode(params):
    ass_params = params.split(",")
    for i in ass_params: exec(i,globals())
    return a+b+c

if __name__ == "__main__":
  params = sys.argv[1]
  print('The sum is: ',maincode(params))

On the SGE scheduler system, you submit the batch job by qsub: qsub batchscriptname.sh. An example of the SGE batch script is as follows:

#!/bin/bash
#
# give the job a name (environment variable $JOB_NAME)
#$ -N a=6,b=7,c=4
# set the shell
#$ -S /bin/bash
# join the output and error into one file (y means yes)
#$ -j y
# name the output file
#$ -o $JOB_ID-$JOB_NAME.out
# work in the current directory
#$ -cwd
#$ -M youremailaddress
# Email when job begins (b), ends (e), or aborted (a)
#$ -m bea
# parallel environment with requested number of cores (environment variable $NSLOTS)
#$ -pe orte 2
# request running time
#$ -l h_rt=00:00:05
# request virtual memory per MPI process
#$ -l h_vmem=1G # the total memory for job is 2G

ulimit -s unlimited
printf "Job started at %s on %s \n" "$(date '+%H:%M:%S')" "$(date '+%m/%d/%Y')" 
echo ----------------Submitted batch script-----------
cat $JOB_SCRIPT | sed '24,27d' | head -n -1
echo ------------End of submitted batch script-----------
conda activate dedalus2
mpirun -n $NSLOTS python3 sample-py.py $JOB_NAME
qstat -j $JOB_ID

I find the online documentation of SGE to be quite confusing compared to that of SLURM. However, this page, printed from the website of the College of Engineering Information Technology at Boston University, explains well how to request memory for the job.

Reflection on my high school Math

Something recently caused me to reflect on my math journey in high school. Perhaps, the memories of old days came back when I glanced through the Mathematics magazine “Toán Học và Tuổi Trẻ” (Mathematics and the Youth). Each issue of this monthly magazine challenges readers with a set of problems, and also gives the solutions to problems given in the previous month. Each question is very tricky. I remember that in my 10th grade my math teacher required us to individually submit solutions to the problem set by the end of each month. Of course, that was not the only assignments we had. We were also supposed to write solution to every problem in a trigonometry book and turn in many weekly problems. To me and my classmates, that was an immense stress.

The purpose of such training is to prepare us for a number of upcoming math contests: the Ho Chi Minh City Math Olympiad, the Vietnam Math Olympiad, the 30/4 Math Olympiad of South Vietnam, and even the International Math Olympiad (IMO). The kind of problems in these contests is similar to the kind of problems found in the Putnam Competition. Each problem is neatly stated and doesn’t require much background to understand. However, it requires a great deal of creativity and brain power to solve. I will refer to this kind of problem as the “olympiad” problems, although they might not be found in an IMO exam. Prof. Arthur Engel wrote an interesting article on how olympiad problems are created.

For my personal training in high school, I often used the websites mathlinks.ro (now called artofproblemsolving.com) and diendantoanhoc.org. On those websites, people post challenging problems for which they might or might not have a solution. When it comes to Number Theory, it is actually not hard to make up an extremely difficult problem without knowing how to solve it. In fact, I made up a few problems in 2005 which were standing until 2013. I did not know how to solve them, nor did I expect anybody to be able to solve them. But I am happy to see that there are ingenious people who are able to solve them. Those problems are:

Problem 1: Show that for each positive integer n, there exists a multiple of 2^n+1 that has exactly n digit 1’s in its decimal form. [Solution here]

Problem 2: Denote by a(n) the number of digit 1’s in the decimal form of n. Show that there exists a positive integer n such that a(n^2+1)=7a(n). [Solution here]

Problem 3: Show that there exist infinitely many square numbers n that has an odd number of digits (in the decimal form) in which the digit 1 occurs only once and it occurs at the middle of n. [Solution here]

There are similarity and difference between the olympiad problems which I did back then and the research problems which I do now. I don’t attempt to give a comprehensive comparison here. Rather, I would like to mention just a few points.

An olympiad problem found on an actual exam always has a solution. If following the “right” track, a person can solve the problem in a reasonable amount of time (an hour or so). A research problem does not have a solution available somewhere. It may take weeks, months, or years to solve (or completely abandoned). An olympiad problem is usually cooked up, i.e. artificially created or complexified from a much easier problem. Its purpose is to test and strengthen the intellectual ability of students. It is usually a stand-alone problem and is not viewed as part of a bigger problem. Unless one can point out how such a problem fits in the math literature, it will be hard to be published on a peer-reviewed journal. On the other hand, a research problem typically stems from an existing line of research, tackling a well-known problem. Knowledge on a research problem is usually incremental. The solution might not be found all at once, but is known little by little over time. Any discovery enriches the literature on that line of research, and is suitable to be published on a peer-reviewed journal. Unlike olympiad problems which are typically intended to be solved individually, research problems often require human collaboration to cope with.

I am grateful for the training that I underwent in high school despite the excessive stress it gave me then. Now as an educator and researcher, I have numerous opportunities to use the skills acquired then to assist my students and research groups.

Characterization of degenerate scaling

This post is a follow-up to my post on 12/29/2023, where I posed the following question:

Let (X_n) be an i.i.d. sequence of \mathbb{R}-valued random variables. Let (a_n) be a positive sequence such that \lim a_n=\infty. Under what condition of (a_n) can one conclude that \lim\frac{X_n}{a_n}=0 almost surely?

I have found a full characterization of the sequence (a_n). After all, it is an interesting exercise.

Proposition: Let (X_n) be an i.i.d. sequence of \mathbb{R}-valued random variables Let (a_n) be a positive sequence such that \lim a_n=\infty. If \sum P\left(|X_1|>\frac{a_n}{c}\right)<\infty for any constant c>0, then \lim\frac{X_n}{a_n}=0 almost surely. Otherwise, almost surely \frac{X_n}{a_n} does not converge.

As an application, if (X_n) is an i.i.d. sequence of mean-one exponentially distributed random variables, then almost surely the sequence of \frac{X_n}{\ln n} does not converge.

Proof: First, suppose that there exists c>0 such that \sum P\left(|X_1|>\frac{a_n}{c}\right)=\infty. Then the independent events E_n=[|X_n|>\frac{a_n}{c}] satisfies

\displaystyle \sum \mathbb{P}(E_n)=\infty.

By Borel-Cantelli’s Lemma, almost surely the events E_n occurs for infinitely many n. Therefore, with probability 1, the sequence (X_n/a_n) does not converge to 0. Now fix b\neq 0. We show that almost surely X_n/a_n\not\to b. Consider the independent events E'_n=\left[|X_n|<\frac{|b|a_n}{2}\right]. One has

\displaystyle \sum\mathbb{P}(E'_n)=\infty

By Borel-Cantelli’s Lemma, almost surely the events E'_n occurs for infinitely many n. Therefore, with probability 1, the sequence (X_n/a_n) does not converge to b.

Next, suppose that for every c>0, one has \sum P\left(|X_1|>\frac{a_n}{c}\right)<\infty. For a fixed constant c>0, the event A_c=\left[\limsup\frac{|X_n|}{a_n}\le \frac{1}{c}\right] is a tail event. Thus, \mathbb{P}(A_c)=0 or \mathbb{P}(A_c)=1 according to Kolmogorov’s 0-1 Law. Note that \cap_{n=1}^\infty \left[|X_n|\le \frac{a_n}{c}\right]\subset A_c and

\displaystyle \mathbb{P}\left(\bigcap\limits_{n=1}^\infty \left[|X_n|\le \frac{a_n}{c}\right]\right)=\prod_{n=1}^\infty \left(1-P\left(|X_1|>\frac{a_n}{c}\right)\right)

which is positive because \sum P\left(|X_1|>\frac{a_n}{c}\right)<\infty. Thus, \mathbb{P}(A_c)>0 and therefore, \mathbb{P}(A_c)=1. Now note that

\displaystyle \mathbb{P}\left(\lim\frac{X_n}{a_n}=0\right)= \mathbb{P}\left(\limsup\frac{|X_n|}{a_n}=0\right)=\lim_{k\to\infty} \mathbb{P}(A_{k})=1.

This completes the proof.

Degenerate scaling of random variables

There are well-known results about the scaling and centering laws of a sequence of \mathbb{R}-valued random variables. That is the problem of finding the right coefficients a_n and b_n such that the sequence \frac{X_n-b_n}{a_n} converges in law. See, for example, Section 14.8 of [1]. Normally, one tries to avoid the degenerate type, which is when the limit distribution is a Dirac mass at 0. I recently ran into a problem where the degenerate type is desirable. More specifically:

Let (X_n) be an i.i.d. sequence of \mathbb{R}-valued random variables. Let (a_n) be a positive sequence such that \lim a_n=\infty. Under what condition of (a_n) can one conclude that \lim\frac{X_n}{a_n}=0 almost surely?

This problem seems to be quite classic. If X_1 has a bounded range, i.e. \mathbb{P}(X_1\in[a,b])=1 for some numbers a and b, then no additional condition of (a_n) is needed. How about the case X_1 has an unbounded range? This is essentially Problem 51 on page 263 of [1]. The textbook does not ask for a full characterization of the sequence a_n. I find the problem quite interesting.

The first observation is that the event A=\left[\lim\frac{X_n}{a_n}=0\right] is a tail event. By Kolmogorov’s 0-1 Law, either \mathbb{P}(A)=0 or \mathbb{P}(A)=1. We just need to decide between the two. One may be tempted to think that \frac{X_n}{a_n} is essentially the same as \frac{X_1}{a_n}, so it should almost surely converge 0 as n\to\infty. This argument is not correct because \frac{X_n}{a_n} only has the same distribution as \frac{X_1}{a_n}. If the range of values of X_1 is unbounded, then so is the range of values of \frac{X_1}{a_n}. Let us consider two following cases, one of which gives \mathbb{P}(A)=0 and the other gives \mathbb{P}(A)=1. Let us denote the cumulative distribution function of |X_1| by F(x).

Case 1: there exists a positive sequence (c_n) such that

\displaystyle \lim\frac{c_n}{a_n}=0 and \displaystyle\sum(1-F(c_n))<\infty.

Let E_n=[|X_n|\le c_n] and E=\cap_{n=1}^\infty E_n. Then \mathbb{P}(E_n)=F(c_n) and

\displaystyle \mathbb{P}(E)=\prod_{n=1}^\infty F(c_n)=\prod_{n=1}^\infty (1-(1-F(c_n))

which has a finite positive value because \sum(1-F(c_n))<\infty. It is easy to see that E\subset A and thus \mathbb{P}(A)=1.

Case 2: there exists a positive sequence (c_n) such that

\displaystyle \lim\frac{c_n}{a_n}>0 and \displaystyle\sum(1-F(c_n))=\infty.

Let E_n=[|X_n|> c_n] and E=\limsup E_n. The events E_n are independent and

\displaystyle \sum \mathbb{P}(E_n)=\sum(1-F(c_n))=\infty.

By Borel-Cantelli’s Lemma, \mathbb{P}(E)=1. It is easy to see that E\subset A^c. Thus, \mathbb{P}(A)=1-\mathbb{P}(A^c)=0.

Final thoughts: Cases 1 and 2 do not cover all possibilities. One can see this through the simple case F(x)=1-e^{-x} (the case when the random variables X_n are i.i.d. exponentials of mean one) and a_n=2\ln n. It would be interesting to see a full characterization of the sequence (a_n) for which one has \mathbb{P}(A)=1. I think this should be known somewhere in the literature.

Update 01/01/2024: I have resolved the problem.

References:

  1. A modern approach to Probability Theory” (1997) by Friestedt and Gray.

Time out

Today is Christmas. I normally work every day of the week, only except for the Sabbath Day (Sunday). Although today is Monday, I decided to abstain from my professional work and just spend time with my family to celebrate this holy-day.

How important it is for us, human beings, to have timeout from time to time. Timeout helps us renew our energy, reconsider our priorities, adjust our course of action, reflect on our path, repent and renew the covenant with God. As a Latter-Day Saint, I honor the Sabbath Day and do my best to keep it holy. Each Sunday becomes a checkpoint for me – a time to record my spiritual and temporal welfare. My professional life is very demanding of my time. With the ability to work remotely, I feel constantly plugged-in. Working in such a manner could easily become an obsession. Obsession, if going untreated, could become an addiction – a mental bond that is extremely hard to break. God, who created all things including our physically bodies and spirits, knows exactly how our bodies work. He sets an example of resting on the seventh day. We likewise need to rest periodically from our labor. In fact, it is one of the Ten Commandments: “Remember the sabbath day, to keep it holy” (Exodus 20:8). Dr. John Tanner at BYU wrote a very interesting article about Labor and Rest. I invite you to check it out.

Sometimes, for various reasons, we may need to rest for longer than a day. I have many hobbies (including reading online books from libraries) each of which is not evil in its own right but is chipping a bit from the precious time I should have spent with my family. For a misbehaving child, a timeout is not only a punishment but also a help for him/her to discontinue their bad behavior. Several days of abstaining from those hobbies have been crucial to help me change the way I use my time.

In this Christmas season, I am so grateful for the Savior, His humble birth, His exemplary life, and His all-encompassing atonement which makes second chances possible to all mankind. No matter how busy and hectic our lives are, we can always find peace in Him.