This tutorial will indicate how to perform GS calculations on hundreds processors using ABINIT.
You will learn how to use some keywords related to the "KGB"
parallelization scheme. "K" stands for "k-point", "G" refers to the wavevector of a planewave, and "B" stands for a "Band".
On one hand, you will see how to improve the
speed-up of your calculations and, on the other hand, how to increase
the convergence speed of a self consistent field calculation. 
This lesson should take about 1.5 hour and requires
      access to at least a 200 CPU core parallel computer.  
You are supposed to know already some basics of parallelism in ABINIT, explained in the tutorial A first introduction to ABINIT in parallel.
On the contrary, the present tutorial is not a prerequisite for other tutorials about parallelism in ABINIT.
Before continuing you might work in a different subdirectory, as for the other lessons. Why not "work_paral_gspw" ? All the input files can be found in the ~abinit/tests/tutoparal/Input directory. You might have to adapt them to the path of the directory in which you have decided to perform your runs. You can compare your results with reference output files located in ~abinit/tests/tutoparal/Refs.
In the following, when "run ABINIT over nn CPU cores" appears, you have to use a specific command line, according to the operating system and architecture of the computer you are using. This can be, for instance,
mpirun -n nn abinit < abinit.files
or the use of a specific
      submission file. Some scripts are given as examples in the directory ~abinit/doc/tutorial/lesson_paral_gspw/. 
      You can adapt them to your own calculations.
When the size of the system increases up to 100 or 1000 atoms, it is usually impossible to perform ab initio calculations on a single computing core. This is because the basis sets used to solve the problem (PWs, bands, ...) increase proportionnally (linearly, as the square, or even exponentially...). The trouble has two origins: (i) the memory, with the amount of data stored in RAM and (ii) the computation time, with specific strong bottlenecks which are for example the eigensolver, the 3dim-FFTs...
Therefore, it is generally mandatory to adopt a parallelization strategy. That is to say: (i) to distribute the data in a MPI-sense on a large number of processors, and (ii) to parallelize the routines responsible for the increase of the computation time.
In this tutorial, we will go beyond the simple k-point and spin
parallelism explained in the tutorial "A first introduction to ABINIT in parallel" 
and show: 
The tests are performed on a system with 108 atoms of gold; this is a benchmark used for a long time during the development and the implementation of the KGB parallelization. With respect to the input file used for the benchmark, the cutoff energy is strongly reduced in this tutorial, for practical reasons. For real tests, you can see the results (in particular the scaling) in:
One of the most simple way to launch the KGB parallelization in ABINIT is to add
just one input variable to the sequential input file. 
This is paral_kgb and controls everything concerning the KGB parallelization: 
the use of the LOBPCG eigensolver (wfoptalg=4 or 14) of A. Knyazev, the parallel 3dim-FFT (fftalg=401) written by S. Goedecker,
and some other tricks... At the time of writing (but not for much longer), you
still have to define the number or processors needed on each level of
the KGB parallelization: npband, npfft and npkpt. 
If ABINIT is not yet able to handle directly the number of cores and
to launch from scratch an efficient distribution of processors, you can already use
this parallelization as a black box. It is possible to get a good
estimation of the most efficient processor distribution by performing
a simple sequential run. This is done by simply adding to the input file
(in addition to paral_kgb=1)
autoparal=1 and
max_ncpus="the maximum number of processors you want".
Note that max_ncpus is a new variable introduced in Abinit version 7.6.2. Prior to this version, one had to use
paral_kgb=-"the maximum number of processors you want"
to specify the maximum number of processors.
 In order to do that, 
copy the file tests/tutoparal/Input/tgspw_01.in and the related tgspw_01.files file in your working directory. 
Then run ABINIT on 1 CPU core.
At the end of the log file tgspw_01.log, you will see:
npimage| npkpt| npspinor| npfft| npband| bandpp | nproc| weight|A weight is assigned to each distribution of processors. As indicated in the documentation, you are advised to select a processor distribution with a higher weight. If we just focus on npband and npfft, we see that for 108 processors the recommended distribution is are (12x9).
1 -> 1| 1 -> 1| 1 -> 1| 1 -> 22| 1 -> 108| 1 -> 65| 8 -> 108| 1 -> 108|
1| 1| 1| 12| 9| 1| 108| 55.13 |
1| 1| 1| 10| 9| 1| 90| 54.98 |
1| 1| 1| 9| 9| 1| 81| 54.85 |
1| 1| 1| 9| 12| 1| 108| 54.37 |
1| 1| 1| 12| 9| 2| 108| 52.86 |
......................
+ npband 108 npfft 1 
A second one with another distribution:
 - npband 108 npfft 1 
 + npband  54 npfft 2 
 
And so on... Alternatively, this can be performed using a shell script including: 
>> cp tgspw_02.in tmp.fileBy reference to the couple (npbandxnpfft), all these results are named: tgspw_02.108-01.out, tgspw_02.054-02.out, tgspw_02.036-03.out, tgspw_02.027-04.out, tgspw_02.018-06.out, tgspw_02.012-09.out and tgspw_02.009-12.out. The timing of each calculation can be retrieved using the shell command:
>> echo "npband 108 npfft 1" >> tgspw_02.in
>> mpirun -n 108 abinit < tgspw_02.files
>> cp tgspw_02.out tgspw_02.108-01.out
>> cp tmp.file tgspw_02.in
>> echo "npband 54 npfft 2" >> tgspw_02.in
>> ...
>> grep Proc *out
>> tgspw_02.009-12.out:- Proc. 0 individual time (sec): cpu= 88.3 wall= 88.3
>> tgspw_02.012-09.out:- Proc. 0 individual time (sec): cpu= 75.2 wall= 75.2
>> tgspw_02.018-06.out:- Proc. 0 individual time (sec): cpu= 63.7 wall= 63.7
>> tgspw_02.027-04.out:- Proc. 0 individual time (sec): cpu= 69.9 wall= 69.9
>> tgspw_02.036-03.out:- Proc. 0 individual time (sec): cpu= 116.0 wall= 116.0
>> tgspw_02.054-02.out:- Proc. 0 individual time (sec): cpu= 104.7 wall= 104.7
>> tgspw_02.108-01.out:- Proc. 0 individual time (sec): cpu= 141.5 wall= 141.5
As far as the timing is concerned, the best distributions are then the ones proposed above in section 1.: that is to say the couples (18x6) and (27x4). So the prediction was pretty good.
>> grep "ETOT 10" *.outThe last column indicates the convergence of the density (or potential) residual. You can see that this quantity is the smallest when npband is the highest. This result is expected: the convergence is better when the size of the block is as large as possible, so need for re-orthogonalization among band processor pools is minimized. But the best convergence is obtained for the (108x1) distribution... when the worst timing is measured.
>> tgspw_02.009-12.out: ETOT 10 -3754.4454784191 -1.549E-03 7.222E-05 1.394E+00
>> tgspw_02.012-09.out: ETOT 10 -3754.4458434046 -7.875E-04 6.680E-05 2.596E-01
>> tgspw_02.018-06.out: ETOT 10 -3754.4457793663 -1.319E-03 1.230E-04 6.962E-01
>> tgspw_02.027-04.out: ETOT 10 -3754.4459048995 -1.127E-03 1.191E-04 5.701E-01
>> tgspw_02.036-03.out: ETOT 10 -3754.4460493339 -1.529E-03 7.121E-05 3.144E-01
>> tgspw_02.054-02.out: ETOT 10 -3754.4460393029 -1.646E-03 7.096E-05 7.284E-01
>> tgspw_02.108-01.out: ETOT 10 -3754.4464631635 -6.162E-05 2.151E-05 7.457E-02
- nstep 10You can copy the input file tgspw_03.in then run ABINIT over 108 cores with tgspw_02.027-04.out, bandpp=1 and bandpp=2. The output files will be named tgspw_03.bandpp1.out and tgspw_03.bandpp2.out, respectively. A comparison between these two files shows that the convergence is better in the second case. The convergence is even achieved before the input file limit of 20 electronic iterations. Quod Erat Demonstrandum:
+ nstep 20
+ npband 27 npfft 4
- npband 27 npfft 4Then run ABINIT over 108 cores and copy the output file to tgspw_03.054-02.out. Perform a diff (with vimdiff for example) between the two output files tgspw_03.bandpp1.out and tgspw_03.054-02.out. You can convince yourself that the two calculations (54x2) with bandpp=1 and (27x4) with bandpp=2, give exactly the same convergence. This result is expected, since the sizes of the block are equal (to 54) and the number of FFT processors npfft does not affect the convergence.
+ npband 54 npfft 2 bandpp 1
It is possible to modify the distribution of processors,
without changing the convergence, by reducing npband and increasing bandpp proportionally. 
In the previous section, we showed that the convergence doesn't change if bandpp and npband change in inverse proportions. What about the influence of bandpp if you fix the distribution? Two cases have to be treated separately.
You can see, in the previous calculations of section 3., that the timing increases when bandpp increases:
>> grep Proc tgspw_03.bandpp1.out tgspw_03.bandpp2.out
>> tgspw_03.bandpp1.out:- Proc. 0 individual time (sec): cpu= 121.4 wall= 121.4
>> tgspw_03.bandpp2.out:- Proc. 0 individual time (sec): cpu= 150.7 wall= 150.7
while there are fewer electronic iterations for bandpp=2 (19) than for bandpp=1 (20). If you perform a diff between these two files, you will see that the increase in time is essentially due to the section "zheegv-dsyegv".
>> grep "zheegv-dsyegv" tgspw_03.bandpp1.out tgspw_03.bandpp2.out
>> tgspw_03.bandpp1.out:- zheegv-dsyegv 1321.797 10.1 1323.215 10.1 513216
>> tgspw_03.bandpp2.out:- zheegv-dsyegv 5166.002 31.8 5164.574 31.8 244944
The "zheegv-dsyegv" is a part of the LOBPCG algorithm which is performed in sequential, so the same calculation is
done on each processor. In the second calculation, the size
of the block being larger (27x2=54) than in the first (27), the
computational time of this diagonalization is more expensive. To sum up, the timing of a single electronic
step increases by increasing bandpp, but the convergence improves.
>> grep Proc *outyou can see that the timing decreases for bandpp=2 and increases thereafter. This behaviour comes from the FFTs. For bandpp=2, the real wavefunctions are associated in pairs in the complex FFTs, leading to a reduction by a factor of 2 of the timing in this part (you can see this reduction by a diff of the output files). Above bandpp=2, there is longer any gain in the FFTs, whereas some significant losses in computational time appear in the "zheegv-dsyegv" section.
>> tgspw_04.bandpp1.out:- Proc. 0 individual time (sec): cpu= 61.4 wall= 61.4
>> tgspw_04.bandpp2.out:- Proc. 0 individual time (sec): cpu= 49.0 wall= 49.0
>> tgspw_04.bandpp4.out:- Proc. 0 individual time (sec): cpu= 62.5 wall= 62.5
>> tgspw_04.bandpp6.out:- Proc. 0 individual time (sec): cpu= 75.3 wall= 75.3
Up to now, we only performed a GB parallelization. This implies
parallelization over 2 levels of PWs or over 2 levels of bands and FFTs,
for different sections of the code (see the paper
or presentation). If the system
has more than 1 k-point,
one can add a third level of parallelization and perform a real KBG
parallelization. There is no additional difficulty in adding 
processors on this level. In order to explain the procedure, we
restart with the same input file that was used in section 2.
and add a denser M-P grid (see the input
file tgspw_05.in). In
this case, the system has 4 k-points in
the IBZ so the calculation can be parallelized over (at most) 4 k-point processors. This is done using the
npkpt input
variable:
 
+ npkpt 4
This implies we use four times more processors than before, so run ABINIT
      over 432 CPU cores. The timing obtained in the output file tgspw_05.out:
>> grep Proc tgspw_05.out
>> - Proc. 0 individual time (sec): cpu= 87.3 wall= 87.3
is quasi-identical to the one obtained for 1 k-point
(69.9 sec, see the output file tgspw_02.027-04.out.
This means that a calculation 4 times larger (due to an increase of the number of k-points)
gives approximatively the same human time if you parallelize over all the k-points. 
You have just re-derived a well established result: the scaling (the speedup) is quasi-linear on the
k-point level.
When you want
to parallelize a calculation, begin by the k-point level, then follow
with the band level (up to npband=50
typically) then finish by the FFT level.
Here, the timing obtained for the output tgspw_05.out leads
to a hypothetical speedup of 346, which is good, but not 432 as
expected if the scaling was linear as a function of the number of the  k-point
processors. Indeed, in order to be comprehensive, we have to mention
that the timing obtained in this output is slightly longer (17 sec.
more) than the one obtained in tgspw_02.027-04.out. 
Compare the time spent in all the routines. A first clue
comes from the timing done below the "vtowfk level", which contains
99.xyz% of sequential processor time: 
>> grep "vtowfk " tgspw_05.out tgspw_02.027-04.outWe see that the KGB parallelization performs really well, since the human time spent within vtowfk is approximatively equal: 26409/432 ~ 6372/108. So, the speedup is quasi-linear below vtowfk. The problem comes from parts above vtowfk which are not parallelized and are responsible for the negligible (1-99.xyz)% of time spent in sequential. These parts are no longer negligible when you parallelize over hundreds of processors.
>> tgspw_05.out:- vtowfk 26409.565 70.7 26409.553 70.7 4320
>> tgspw_05.out:- vtowfk 26409.565 70.7 26409.553 70.7 4320
>> tgspw_02.027-04.out:- vtowfk 6372.940 84.8 6372.958 84.8 1080
>> tgspw_02.027-04.out:- vtowfk 6372.940 84.8 6372.958 84.8 1080
This behaviour is related to the Amdhal's law: "The speedup of a program using multiple processors in parallel computing is limited by the time needed for the sequential fraction of the program. For example, if a program needs 20 hours using a single processor core, and a particular portion of 1 hour cannot be parallelized, while the remaining promising portion of 19 hours (95%) can be parallelized, then regardless of how many processors we devote to a parallelized execution of this program, the minimum execution time cannot be less than that critical 1 hour. Hence the speedup is limited up to 20."
In our case, the part above the loop over k-point
in not parallelized by the KGB parallelization. Even if this part is
very small, less than 1%, when the biggest part below is strongly
(perfectly) parallelized, this small part determines an upper bound for the speedup.