Performance tuning of OMPIO

There are two specific aspects to tuning OMPIO:

Choosing a component in an OMPIO sub-framework

The OMPIO architecture is designed around sub-framework which allow to write small, customizable code sections that serve a particular environment, application or infrastructure best. Although significant efforts have been invested into making good decisions for default values and switching points between components, users and/or system administrators might occasionally want to tune the selection logic of the components and force the utilization of a particular component. The simplest way to force the usage of a component is to simply restrict the list of available components for that framework. For example, an application wanting to use the dynamic fcoll component simply has to pass the name of the component as a value to the corresponding MCA parameter during mpirun or any other mechanism available in Open MPI to influence a parameter value, e.g.
mpirun --mca fcoll dynamic -np 64 ./myapplication
FS and FBTL components are typically chosen based on the file system type utilized, e.g. the PVFS2 component is chosen when the file is located on an PVFS2 file system, the Lustre component is chosen for Lustre file systems etc.

The FCOLL framework provides four different implementation, which provide different levels of data reorganization across processes. two_phase, dynamic segmentation, static segmentation and individual provide decreasing communication costs during the shuffle phase of the collective I/O operations (in the order listed here), but provide also decreasing contiguity guarantuee of the data items before the aggregators read/write data to/from the file. The current decision logic in OMPIO is using the file view provided by the application as well as file system level characteristics (stripe width of the file system) when deciding which component to choose.

The SHAREDFP framework provides different implementation of the shared file pointer operations depending on file system features (specifically: support for file locking), locality of the MPI processes in the communicator that has been used to open the file, or guarantuess by the application on using only a subset of the available functionality (i.e. write operations only).

Setting MCA parameters of OMPIO components

One of the key advantages of the Open MPI (and therefore OMPIO) architecture is its ability to influence the performance of component through MCA parameters, without having to recompile the library or the application. To retrieve the full list of MCA parameters of a given component, please refer to the man pages of the ompi_info tool. As an example, the following command
ompi_info --level 9 --param io ompio
will display all parameter of the ompio component.

MCA parameters of the ompio component

The most important parameters influencing the performance of an I/O operation are listed below:

MCA parameters of the fs components

The main parameters of the fs components allow to manipulate the layout of a new file on a parallel file system.

MCA parameters of the fbtl components

No performance relevant parameters are available for the fbtl components at this point.

MCA parameters of the fcoll components

The design of the fcoll frameworks maximizes the utilization of parameters of the OMPIO component, in order to minimize the number of similar MCA parameter in each component. For example, the two_phase, dynamic segmentation, static segmentation components all retrieve the io_ompio_bytes_per_agg parameter to define the collective buffer size and the io_ompio_num_aggregators to force the utilization of a certain number of aggregators.

MCA parameters of the sharedfp components

No performance relevant parameters are available for the sharedfp components at this point.

Tuning guidelines

The most relevant parameter that can be tuned in advance is the io_ompio_bytes_per_agg parameter of the OMPIO component. This parameter is essential for the selection of the collective I/O component as well for determining the optimal number of aggregators for a collective I/O operation.

This is a file system specific value, independent of the application scenario. To determine the correct value on your system, take an I/O benchmark, e.g. the IMB or IOR benchmark, and run an individual, single process write test. E.g. for IMB

mpirun -np 1 ./IMB-IO S_write_indv
For IMB, use the values obtained for AGGREGATE test cases. Plot the bandwidth over the message length. The optimal value for io_ompio_bytes_per_agg is the smallest message length which (nearly) saturates the bandwidth of the file system. (Note: Make sure that the io_ompio_cycle_buffer_size parameter is set to -1 when running this test, which is its default value). The value of io_ompio_bytes_per_agg could be set by system administrators into the system wide Open MPI configuration file, or by users individually. See the FAQ entry on setting MCA parameters. for all details.

For more exhaustive tuning of I/O parameters, we recommend the utilization of the Open Tool for Parameter Optimization (OTPO) a tool specifically designed to explore the MCA parameter space of Open MPI.