Data exchange using MPI. Working with the MPI Library on the example of the Intel® MPI Library

In this post, we will discuss the organization of data exchange using MPI using the example of the Intel MPI Library. We think that this information will be of interest to anyone who wants to get acquainted with the field of parallel high-performance computing in practice.

We will provide a brief description of how data is organized in parallel applications based on MPI, as well as links to external sources with a more detailed description. In the practical part, you will find a description of all the development stages of the “Hello World” demo MPI application, starting with setting up the necessary environment and ending with launching the program itself.

MPI (Message Passing Interface)

MPI is an interface for transferring messages between processes that perform a single task. It is intended, first of all, for systems with distributed memory ( MPP ) in contrast to, for example, OpenMP . A distributed (cluster) system, as a rule, is a set of computational nodes connected by high-performance communication channels (for example, InfiniBand ).
')
MPI is the most common standard for data communication interface in parallel programming. MPI Standardization is engaged in MPI Forum . There are implementations of MPI for most modern platforms, operating systems and languages. MPI is widely used in solving various problems of computational physics, pharmaceuticals, materials science, genetics and other fields of knowledge.

A parallel program from the point of view of MPI is a set of processes running on different compute nodes. Each process is generated based on the same program code.

The main operation in MPI is messaging. In MPI, almost all the basic communication patterns are implemented: point-to-point, collective (collective) and one-sided.

Work with MPI

Let's look at a live example of how a typical MPI program works. As a demonstration application, we take the source code of the sample supplied with the Intel MPI Library. Before starting our first MPI program, you need to prepare and set up a working environment for experiments.

Setting up a cluster environment

For experiments, we need a pair of computational nodes (preferably with similar characteristics). If there are no two servers at hand, you can always use cloud-services.

For the demonstration, I chose Amazon Elastic Compute Cloud (Amazon EC2). New users Amazon offers a trial year of free use of entry-level servers.

Working with Amazon EC2 is intuitive. If you have questions, you can refer to the detailed documentation (in English). If you wish, you can use any other similar service.

We create two working virtual servers. In the management console, select EC2 Virtual Servers in the Cloud , then Launch Instance (“Instance” means an instance of a virtual server).

The next step is choosing the operating system. Intel MPI Library supports both Linux and Windows. For the first acquaintance with MPI, choose OS Linux. Choose Red Hat Enterprise Linux 6.6 64-bit or SLES11.3 / 12.0 .
Select Instance Type (server type). For experiments, we can use t2.micro (1 vCPUs, 2.5 GHz, Intel Xeon processor family, 1 GiB of RAM). As a newly registered user, I could use this type for free - marked “Free tier eligible”. Set the Number of instances : 2 (the number of virtual servers).

After the service prompts us to launch Launch Instances (configured virtual servers), we save the SSH keys that we need to communicate with the virtual servers from the outside. The status of the virtual servers and the IP address for communicating with the servers of the local computer can be monitored in the management console.

An important point: in the Network & Security / Security Groups settings, you need to create a rule that will open ports for TCP connections — this is needed for the MPI process manager. The rule might look like this:

Type: Custom TCP Rule
Protocol: TCP
Port Range: 1024-65535
Source: 0.0.0.0/0

For security reasons, you can set a stricter rule, but for our demo, this is enough.

Here you can read instructions on how to connect to virtual servers from a local computer (in English).
I used Putty to communicate with working servers from a computer on Windows, WinSCP for transferring files. Here you can read the instructions for setting them up to work with Amazon services (in English).

The next step is to configure SSH. In order to configure passwordless SSH with public key authorization, you must perform the following steps:

On each of the hosts, we launch the ssh-keygen utility - it will create a pair of private and public keys in the $ HOME / .ssh directory;
We take the contents of the public key (the file with the .pub extension) from one server and add it to the $ HOME / .ssh / authorized_keys file on another server;
Perform this procedure for both servers;
Let's try to connect via SSH from one server to another and back to check the correctness of the SSH configuration. When you first connect, you may need to add the remote host's public key to the $ HOME / .ssh / known_hosts list.

Configuring MPI Library

So, the working environment is configured. Time to install MPI.
As a demonstration option, take the 30-day trial version of the Intel MPI Library (~ 300MB). If desired, you can use other implementations of MPI, for example, MPICH . The latest available version of the Intel MPI Library at the time of writing of article 5.0.3.048, and we will take it for experiments.

Install the Intel MPI Library, following the instructions of the built-in installer (you may need superuser privileges).

$ tar xvfz l_mpi_p_5.0.3.048.tgz
$ cd l_mpi_p_5.0.3.048
$ ./install.sh

Perform an installation on each of the hosts with the same installation path on both nodes. A more standard way to deploy MPI is to install in the network storage available on each of the working nodes, but the description of setting up such storage is beyond the scope of the article, therefore we restrict ourselves to a simpler option.

To compile the demo MPI program, we use the GNU C compiler (gcc).
In the standard set of RHEL programs, the image from Amazon does not exist; therefore, you need to install it:

$ sudo yum install gcc

As a demo MPI program, take test.c from the standard set of Intel MPI Library examples (located in the intel / impi / 5.0.3.048 / test folder).
To compile it, the first step is setting up the Intel MPI Library environment:

$. /home/ec2-user/intel/impi/5.0.3.048/intel64/bin/mpivars.sh

Next, we compile our test program using a script from the Intel MPI Library (all necessary MPI dependencies will be set automatically when compiled):

$ cd /home/ec2-user/intel/impi/5.0.3.048/test
$ mpicc -o test.exe ./test.c

The resulting test.exe is copied to the second node:

$ scp test.exe ip-172-31-47-24: /home/ec2-user/intel/impi/5.0.3.048/test/

Before starting the MPI program, it will be useful to make a test run of some standard Linux utility, for example, 'hostname':

$ mpirun -ppn 1 -n 2 -hosts ip-172-31-47-25, ip-172-31-47-24 hostname
ip-172-31-47-25
ip-172-31-47-24

The 'mpirun' utility is a program from the Intel MPI Library designed for running MPI applications. This is a kind of "runner". It is this program that is responsible for running an instance of an MPI program on each of the nodes listed in its arguments.

Regarding options, '-ppn' is the number of processes launched per node, '-n' is the total number of processes started, '-hosts' is the list of nodes where the specified application will be launched, the last argument is the path to the executable file (this can be and an application without MPI).

In our example with the launch of the hostname utility, we should get its output (the name of the compute node) from both virtual servers, then it can be argued that the MPI process manager is working correctly.

"Hello World" using MPI

As a demo MPI application, we took test.c from the standard set of Intel MPI Library examples.

The MPI demo application collects from each of the parallel MPI processes running some information about the process and the computing node on which it is running, and prints this information on the head node.

Let us consider in more detail the main components of a typical MPI program.

#include "mpi.h"

Includes the mpi.h header file, which contains declarations of the main MPI functions and constants.
If we use special scripts from the Intel MPI Library (mpicc, mpiicc, etc.) to compile our application, the path to mpi.h is automatically entered. Otherwise, the path to the include folder will have to be set at compilation.

 MPI_Init (&argc, &argv); ... MPI_Finalize ();

The call MPI_Init () is necessary to initialize the MPI program execution environment. After this call, you can use the remaining MPI functions.
The final call to the MPI program is MPI_Finalize (). Upon successful completion of the MPI program, each of the running MPI processes makes a call to MPI_Finalize (), in which the internal MPI resources are cleaned. Calling any MPI function after MPI_Finalize () is invalid.

To describe the remaining parts of our MPI program, it is necessary to consider the basic terms used in MPI programming.

An MPI program is a set of processes that can send messages to each other through various MPI functions. Each process has a special identifier - rank. The rank of the process can be used in various operations for sending MPI messages, for example, the rank can be specified as an identifier of the message recipient.

In addition, there are special objects in MPI, called communicators, which describe process groups. Each process within a single communicator has a unique rank. The same process may relate to different communicators and, accordingly, may have different ranks within different communicators. Each data transfer operation in MPI must be performed within the framework of some kind of communicator. By default, the MPI_COMM_WORLD communicator is always created, which includes all existing processes.

Let's go back to test.c:

 MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &rank);

MPI_Comm_size () calls the variable in the size (size) of the current MPI_COMM_WORLD communicator (the total number of processes that we specified with the mpirun option '-n').
MPI_Comm_rank () writes to the rank variable of the current MPI process as part of the MPI_COMM_WORLD communicator.

 MPI_Get_processor_name (name, &namelen);

Calling MPI_Get_processor_name () will write in the name variable a string identifier (name) of the computation node on which the corresponding process was started.

The collected information (process rank, MPI_COMM_WORLD dimension, processor name) is then sent from all non-zero ranks to zero using the MPI_Send () function:

 MPI_Send (&rank, 1, MPI_INT, 0, 1, MPI_COMM_WORLD); MPI_Send (&size, 1, MPI_INT, 0, 1, MPI_COMM_WORLD); MPI_Send (&namelen, 1, MPI_INT, 0, 1, MPI_COMM_WORLD); MPI_Send (name, namelen + 1, MPI_CHAR, 0, 1, MPI_COMM_WORLD);

The MPI_Send () function has the following format:

MPI_Send (buf, count, type, dest, tag, comm)
buf is the address of the memory buffer in which the data being sent are located;
count - the number of data elements in the message;
type - the type of the data elements of the message being sent;
dest - the rank of the recipient process of the message;
tag is a special tag for identifying messages;
comm is the communicator in which the message is being sent.

A more detailed description of the MPI_Send () function and its arguments, as well as other MPI functions can be found in the MPI standard (the language of the documentation is English).

At the zero rank, messages sent by other ranks are accepted and printed on the screen:

 printf ("Hello world: rank %d of %d running on %s\n", rank, size, name); for (i = 1; i < size; i++) { MPI_Recv (&rank, 1, MPI_INT, i, 1, MPI_COMM_WORLD, &stat); MPI_Recv (&size, 1, MPI_INT, i, 1, MPI_COMM_WORLD, &stat); MPI_Recv (&namelen, 1, MPI_INT, i, 1, MPI_COMM_WORLD, &stat); MPI_Recv (name, namelen + 1, MPI_CHAR, i, 1, MPI_COMM_WORLD, &stat); printf ("Hello world: rank %d of %d running on %s\n", rank, size, name); }

For clarity, the zero rank additionally prints its data like the ones it took from remote ranks.

The MPI_Recv () function has the following format:

MPI_Recv (buf, count, type, source, tag, comm, status)
buf, count, type - memory buffer for receiving the message;
source - the rank of the process from which the message should be received;
tag - the tag of the received message;
comm is the communicator in which data is received;
status is a pointer to a special MPI data structure that contains information about the result of the data receiving operation.

In this article we will not delve into the subtleties of the functions MPI_Send () / MPI_Recv (). A description of the various types of MPI operations and the subtleties of their work is the topic of a separate article. We only note that the zero rank in our program will receive messages from other processes strictly in a certain sequence, starting with the first rank and incrementally (this is determined by the source field in the MPI_Recv () function, which varies from 1 to size).

The MPI_Send () / MPI_Recv () functions described are examples of the so-called two-point (point-to-point) MPI operations. In such operations, one rank communicates with another within a specific communicator. There are also collective (collective) MPI operations in which more than two ranks can participate in data exchange. Collective MPI operations are a topic for a separate (and, possibly, not one) article.

As a result of our demo MPI program, we get:

$ mpirun -ppn 1 -n 2 -hosts ip-172-31-47-25, ip-172-31-47-24 /home/ec2-user/intel/impi/5.0.3.048/test/test.exe
Hello world: rank 0 of 2 running on ip-172-31-47-25
Hello world: rank 1 of 2 running on ip-172-31-47-24

Are you interested in this post and would like to take part in the development of MPI technology? The Intel MPI Library development team (Nizhny Novgorod) is currently actively looking for fellow engineers. Additional information can be found on the official website of Intel and on the BrainStorage website.

And finally, a small survey about possible topics for future publications on high-performance computing.

Source: https://habr.com/ru/post/251357/

All Articles