Background
This is my partial namelist of WRF-Chem:
&time_control
run_days = 147,
&domains
time_step = 72,
time_step_fract_num = 0,
time_step_fract_den = 1,
max_dom = 1,
e_we = 430,
e_sn = 345,
e_vert = 40,
dx = 12000,
dy = 12000,
p_top_requested = 10000,
I used 384
cores to run it and I found the speed of calculation
and writing
is ~5.45 seconds/step, ~54.5 s/wrfout_file and ~210 s/wrfrst_file. The terrible one is the speed of I/O tasks.
WRF I/O PERFORMANCE
Here’s the part of paper Improving I/O Performance of the Weather Research and Forecast (WRF) Model.
- Serial NetCDF: Default layer.
- Parallel NetCDF: A good alternative layer that works well at lower core counts built on the Parallel NetCDF (PNetCDF) library, which supports parallel I/O.
- Quilt Servers: A third technique for writes that uses I/O (or quilt) servers that deal exclusively with I/O, enabling the compute PEs to continue with their work without waiting for data to be written to disk before proceeding.
- Quilt Servers with PNetCDF: An additional technique that combines the I/O server concept with PNetCDF to enable parallel asynchronous writes of WRF history and restart files. This technique proves to be highly advantageous on the Cray XC40 under certain circumstances.
It’s better to read the paper first and then check this tutorial. This tutorial is arranged by the structure above.
Parallel NetCDF
We need to edit the namelist.input to make sure PNetCDF works well:
io_form_history = 11,
io_form_restart = 11,
io_form_input = 11,
This is the meaning of each one:
Variable Names | InputOption | Description |
---|---|---|
io_form_history | 2 | netCDF |
11 | parallel netCDF | |
io_form_restart | 2 | netCDF |
11 | parallel netCDF | |
io_form_input | 2 | netCDF |
11 | parallel netCDF | |
io_form_boundary | 2 | netCDF |
11 | parallel netCDF |
I don’t know why InputOption = 11 isn’t included in io_form_restart
and io_form_input
according to Uesr’s Guide.
Decomposition and Quilting
There’re two types of MPI tasks: compute (client) and I/O (server).
Compute tasks
Total number = nproc_x*nproc_y (number of processors along x and y axes for decomposion)
By default WRF will use the square root of the processors for values in nproc_x and nproc_y. If it is not possible, it will use some values that are close to each other.
This is not correct as WRF responds better to a more rectangular decomposition (i.e. X«Y). This leads to longer inner loops for better vector and register reuse, better cache blocking, and more efficient halo exchange communication pattern.
I/O tasks
To set aside one or more ranks (known as quilt or I/O servers) to deal exclusively with the I/O, so that once the compute (client) ranks have sent their data to these I/O servers, they can continue with their work while the data is formatted and written to disk in the background (asynchronously).
Whether or not this technique is appropriate depends on the amount of output time taken by PNetCDF and the number of compute ranks being used, since it can be inefficient to dedicate too high a proportion of ranks to I/O only.
WRF attempts to match each I/O server with compute tasks in the east-west rows, and ideally (though this is not mandatory) nproc_y should be an exact multiple of nio_tasks_per_groups.
Total number = nio_groups * nio_tasks_per_group
Patch and tile
numtiles
is the number of tiles per patch. The use of tiling has greatest effect on lower processors when the patches do not fit into cache.
Tests
Without PNetCDF
Number of cores | MPI tasks | nproc_x * nproc_y | nio_groups * nio_tasks_per_group | Speed of calculation (s/step) | Speed of writing (s/file) |
---|---|---|---|---|---|
384 | 384 | 16*24 | 5.45 | 54.5 (wrfout)210 (wrfrst) | |
384 | 374 | 11*34 | 5*2 | 5.5 | 1.37 (wrfout) |
336 | 312 | 13*24 | 12*2 | 6.6 | 1.29 (wrfout) |
336 | 312 | 13*24 | 6*4 | 6.6 | 1.19(wrfout)1.37(wrfrst) |
384 | 374 | 11*34 | 2*5 | 5.5 | 0.86(wrfout)1.78(wrfrst) |
Number of cores = nproc_x * nproc_y + nio_groups * nio_tasks_per_group
When using the second option, I got this error when using the first setting:
FATAL CALLED FROM FILE: <stdin> LINE: 676
Possible 32-bit overflow on output server. Try larger nio_tasks_per_group in n
amelist.
According to this pdf, Since the I/O servers gather data from many compute ranks, they require more memory than the compute ranks (generally a similar requirement to the serial rank 0 collector) and so cannot be fully packed onto nodes with large numbers of cores.
I need a larger nio_tasks_per_group mentioned in that pdf. So, I tested option_3. However, it still failed.
After changing the collocation to 4*6
, it works. I guess WRF-Chem needs more groups than WRF to write wrfrst* files.
So, I chose the last option which meets two condition.
With PNetCDF
stagged at med_initialdata_input: calling input_input
Number of cores | MPI tasks | nproc_x * nproc_y | nio_groups * nio_tasks_per_group | Speed of calculation (s/step) | Speed of writing (s/file) |
---|---|---|---|---|---|
384 | 374 | 11*34 | 2*5 | ?? | ?? |
Say something
Thank you
Your comment has been submitted and will be published once it has been approved.
OOPS!
Your comment has not been submitted. Please go back and try again. Thank You!
If this error persists, please open an issue by clicking here.