4. Running the code

4.1. Binaries and execution

Compilation should produce three executables: rh15d_ray_pool, rh15d_ray, and rh15d_lteray. The latter is a special case for running only in LTE. The other two are the main programs. They represent two different modes of job distribution: normal and pool.

In the pool mode there is a process that works as overlord: its function is to distribute the work to other processes. The other processes (drones) ask the overlord for a work unit. When they finish their unit, they go back and ask for more, until all tasks are completed. Because of race conditions and because different columns will run at different speeds, it is not possible to know which columns a given process will run beforehand. Due to the overlord, rh15d_ray_pool needs to run with two or more processes. The advantage of the pool mode is that the dynamic load allocation ensures the most efficient use of the resources. With the normal mode it may happen that some processors will work on columns that take longer to converge (especially as they are adjacent), and in the end the execution will have to wait for the process that takes longer. In some cases (especially with PRD) the pool mode can be 2-3 times faster than the normal mode. When one runs with a large number of processes (> 2000) and each column takes little time to calculate, the pool mode can suffer from communication bottlenecks and may be slower because a single overlord cannot distribute the tasks fast enough. The only disadvantage of the pool mode (so far) is that not all output is currently supported with this mode.

Warning

The normal mode is deprecated and will be removed in a later revision. Use only for single processor runs or if you know what you’re doing!

In the normal mode the jobs (all the atmosphere columns for which one wants to calculate) are divided by the number of processes at the start of the execution. There is no communication between processes, and each process knows from the start all the columns it is going to run. These columns are adjacent. If the number of columns is not a multiple of the number of processes, there will be some processes with larger workloads. There is no minimum number of processes to run, and rh15d_ray can also be run in a single process. Regions of an atmosphere can take a lot longer to run than others, and the processes that work on those will take longer to finish. In the normal mode this means that the slowest process will set the overall running time, and therefore in practice it can take more than 10x longer than the pool mode (and is therefore not recommended).

As an MPI program, the binaries should be launched with the appropriate command. Some examples:

mpirun -np N ./rh15d_ray_pool
mpiexec ./rh15d_ray_pool    # use in Pleiades
aprun -B ./rh15d_ray        # use in Hexagon or other Cray

4.2. The run directory

Warning

Before running, make sure you have the sub-directories scratch and output in the run directory.

The run directory contains the configuration files, the binaries, and the scratch and output directories. As the names imply, temporary files will be placed under scratch and the final output files in output. No files under scratch will be used after the run is finished (they are not read for re-runs).

The scratch directory contains different types of files. Most of them are binary files write by RH 1.5D to save memory. Example files are the background_p*.dat with the background opacities, files with PRD weights, and the rh_p*.log log files. Each process creates one of those files, and they will have the suffix _pN.*, where N is the process number. The log files have the same format as in RH. The writing of each process’s log file is buffered by line. Because these are updated often, when running with many processes this can be a drag on some systems. Therefore, it is possible to run full buffering (meaning log files are only written when the main program finishes). This option is not exposed in the configuration files, so one needs to change the file parallel.c in the following part:

/* _IOFBF for full buffering, _IOLBF for line buffering */
setvbuf(mpi.logfile, NULL, _IOLBF, BUFSIZ_MPILOG);

One should replace _IOLBF by _IOFBF to change from line buffering to full buffering.

The output directory will contain the three output files: output_aux.hdf5, output_indata.hdf5, and output_ray.hdf5. See Output file structure for more details on the structure of these files. If doing a re-run, these files must already exist; they will be updated with the new results. Otherwise, if these files are already in output before the execution, they will be overwritten. At the start of the execution, the output files are written with a special a fill value. This means that the disk space for the full output must be available at the start of the run, and no CPU time will be wasted if at the end of the run there is not enough disk space. The files are usually written every time a process finishes work on a given column. The fill value arrays are overwritten with the data. One advantage of this method is that even if the system crashes or the program stops, it is possible to recover the results already written (and a re-run can be performed for just the missing columns).

All the processes write asynchronously to all the output files. In some cases this can cause contention in the filesystem, with many processes trying to access the same data at the same time. In the worst case scenario, the contention can create bottlenecks which practically stop the execution. Therefore, it is highly recommended that the users tune their filesystem for the typical loads of RH. Many supercomputers make use of Lustre, a parallel filesystem. With Lustre, resources such as files can be divided in different stripes that can be placed in several different machines (OSTs). For running RH with more than 500 processes, one should use as many OSTs as available in the system, and select the lustre stripe size to the typical amount of data written to a file per simulation column. The stripe can set with the lfs setstripe command:

lfs setstripe -s stripe_size -c stripe_count -o stripe_offset directory|filename

It can be run per file (e.g. output_ray.hdf5), or for the whole output directory. Using a stripe count of -1 will ensure that the maximum number of OSTs is used. For the typical files RH 1.5D produces, it is usually ok to apply the same Lustre settings to the whole output directory, and the following settings seem to reasonable:

lfs setstripe -s 4m -c -1 output/

Similarly, the scratch directory can also benefit from Lustre striping. Because most files there are small, it is recommended to use a stripe count of 1 for scratch.

Warning

You need a parallel filesystem if you are running RH 1.5D across more than one host. Most supercomputers have parallel filesystems, but if you are running in a smaller cluster this may not be the case. RH 1.5D will always run, but the HDF5 writes will not work and the results will be unreadable. NFS is not a parallel file system.

4.3. Logs and messages

In addition to the logs per process saved to scratch, a much smaller log will be printed in stdout. This log is a smaller summary of what each process is doing. Here is an example of typical messages:

Process   1: --- START task   1, (xi,yi) = (  0,156)
Process 232: --- START task   1, (xi,yi) = (  0,159)
Process  36: --- START task   1, (xi,yi) = (  0,162)
Process  12: --- START task   1, (xi,yi) = (  0,171)
(...)
Process  12: *** END   task   1 iter, iterations = 121, CONVERGED
Process   3: *** END   task   1 iter, iterations = 200, NO Convergence
Process   4: *** SKIP  task   1 (crashed after 81 iterations)
Process   3: --- START task   2, (xi,yi) = ( 23, 64)
Process  12: --- START task   2, (xi,yi) = ( 23, 65)
Process   4: --- START task   2, (xi,yi) = ( 23, 65)
(...)
*** Job ending. Total 262144 1-D columns: 262142 converged, 1 did not converge, 1 crashed.
*** RH finished gracefully.

In this example one can see the three possible outputs for a single-column calculation: convergence, non-convergence (meaning the target ITER_LIMIT was not met in N_MAX_ITER iterations), or a crash (many reasons). If there are singular matrices or other causes for a column to crash, RH 1.5D will skip that column and proceed to the next work unit. Such cases can be re-run with different parameters. In some cases (e.g. inexistent files) it is not possible to prevent a crash, and RH 1.5D will finish non-gracefully.

4.4. Reruns and lack of convergence

Dynamic atmospheres often have large gradients in temperature, density, velocity, etc. that cause problems when solving the non-LTE problem. This may lead to some (or all) columns not converging, and is dependent on the input atmosphere, model atoms, and run options. In RH terms, non-converged or “crashed”, represent the same problem. In some cases, the iterations diverge strongly (crash), while in others they fail to reach the target limit for convergence in the allocated maximum number of iterations (non-convergence).

The output for atmosphere columns that did not converged or crashed is not saved to disk (a fill value is used instead). The recommended procedure in these cases is to rerun RH with different input options (e.g. less agressive acceleration, a larger number of maximum iterations). A special rerun mode is available to save time calculating again the columns that have already converged. When 15D_RERUN = TRUE in keyword.input, RH will read the output and run again only for the columns that did not converge.

The rerun mode requires all the previous output to be under output/. The user has the possibility of changing options in keyword.input for the rerun. Not all options can be changed, because this could lead to non-sensical results (e.g. changing the input atmosphere or atom files). Only the following options can be changed in keyword.input:

15D_RERUN
15D_DEPTH_CUT
15D_TMAX_CUT
15D_DEPTH_REFINE
N_PESC_ITER
COLLRAD_SWITCH
COLLRAD_SWITCH_INIT
PRD_SWITCH
NRAYS
N_MAX_SCATTER
N_MAX_ITER
ITER_LIMIT
NG_DELAY
NG_ORDER
NG_PERIOD
S_INTERPOLATION
PRD_N_MAX_ITER
PRD_ITER_LIMIT
B_STRENGTH_CHAR

All the other options are locked to the values used for the firs run. Likewise, it is not possible to change atoms.input, the atom files, or the line list files. When RH 1.5D is first run, nearly all input options (except the input atmosphere and molecule files) are saved into the output, and in a rerun these are read from the output and not from the original files. This also means that a rerun can be performed even if the original files are no longer available.

4.5. Helper script

There is a Python script called runtools.py designed to make it easier to run RH 1.5D for large projects. It resides in rh/python/runtools.py. It requires Python with the numpy and h5py (or netCDF4) modules. It was made to run a given RH 1.5D setup over many simulation snapshots, spanning several atmosphere files. It supports a progressive rerun of a given problem, and allows the of use different keyword.input parameters for different columns, tackling columns harder to converge.

The first part of runtools.py should be modified for a users’s need. It typically contains:

atmos_dir = '/mydata_dir'
seq_file = 'RH_SEQUENCE'
outsuff = 'output/output_ray_mysim_CaII_PRD_s%03i.hdf5'
mpicmd = 'mpiexec'
bin = './rh15d_ray_pool'
defkey = 'keyword.save'
log = 'rh_running.log'
tm = 40
rerun = True
rerun_opt = [ {'NG_DELAY': 60, 'NG_PERIOD': 40, '15D_DEPTH_REFINE': 'FALSE',
               '15D_ZCUT': 'TRUE', 'N_MAX_ITER': 250, 'PRD_SWITCH': 0.002 },
              {'NG_DELAY': 120, 'NG_PERIOD': 100, '15D_DEPTH_REFINE': 'TRUE',
               'PRD_SWITCH': 0.001 } ]

The different options are:

Name Type Description
atmos_dir string Directory where the atmosphere files are kept.
seq_file string Location of sequence file. This file contains the names of the atmosphere files to be used (one file per line). The script will then run RH 1.5D for every snapshot in every file listed.
outsuff string Template to write the output_ray.ncdf files. The %03i format will be replaced with the snapshot number.
mpicmd string System-dependent command to launch MPI. The script knows that for aprun the -B option should be used. This option also activates system specific routines (e.g. how to kill the run in pleiades).
bin string RH 1.5D binary to use.
defkey string Default template for keyword.input. Because the of the rerun options, keyword.input is overwritten for every rerun. This file is used as a template it (i.e., most of its options will be unchanged, unless specified in rerun_opt).
log string File where to save the main log. Will be overwritten for each new snapshot.
tm int Timeout (in minutes) to kill execution of code, if there is no message written to main log. Used to prevent code from hanging if there are system issues. After killed, program is relaunched. If tm = 0, program will never be killed.
rerun bool If True, will re-run the program (with different settings) to achieve convergence if any columns failed. Number of reruns is given by size of rerun_opt.
rerun_opt list Options for re-run. This is a list made of dictionaries. Each dictionary contains the keywords to update keyword.input. Only the keywords that differ from the defkey file are necessary.

Note

Only the first line of the sequence time is read at a time. The script reads the first line, deletes it from the file, and closes the file. It then reads the first line again and continues running, until there are no more lines in the file. This behaviour enables the file to be worked by multiple scripts at the same time, and allows one to dynamically change the task list at any time of the run.

Note

The script also includes a tenaciously persistent wait and relaunch feature designed to avoid corruption if there are system crashes or problems. Besides the tm timeout, if there is any problem with the execution, the code will wait for some periods and try and relaunch the code. For example, if one of the atmosphere files does not exist, runtools.py will try three times and then proceed to the next file.