A parallel machine is called, roughly speaking, a set of processors, memory, and some methods of communication between them. This may be a dual-core processor in your (no longer new) laptop, a multiprocessor server, or, for example, a cluster (supercomputer). You may not know anything about such computers, but you know exactly why they are being built: speed, speed, and again speed. However, speed is not the only advantage.
After completing the not very trivial task of creating such a device, designers and developers still have to think about how to make it work. After all, the techniques and algorithms used for old single-processor single-threaded machines, as a rule, are not suitable.
What is most surprising is that in universities they are not in a hurry to translate the curriculum into parallel computing! At the same time today you need to try to find a computer with one core. In my native Carleton University, parallel computing courses are not part of the Bachelor of Computer Science mandatory program, and are only available to those who have completed the basic courses of the first three years. At the same level are courses on distributed computing, and some can be confusing.
')
What is the difference? Equipment that is usually in one physical place participates in parallel computing, they are closely interconnected and all parameters of their work are known to the programmer. In distributed computing, there is no close constant connection between nodes, according to the name, they are distributed over a certain territory and the operating parameters of this system are dynamic and not always known.
Classical algorithms that are taught the last half century, no longer suitable for parallel systems. As you might guess, the algorithm for a parallel machine is called a parallel algorithm. It often depends heavily on the architecture of a particular machine and is not as versatile as its classic ancestor.
You may ask: why it was necessary to invent a new structure, then pause over new algorithms, and could not it continue to accelerate conventional computers? First, there is no way to make one super-fast processor that can be compared with modern parallel super-computers; and if there is, then it will take a lot of resources. In addition, many tasks are well solved by parallel machines, primarily due to this structure! Calculation of complex chaotic systems like weather, simulation of elementary particle interactions in physics, modeling at the nano-level (yes, yes, nanotechnology!), Data mining (about which we
have a blog on the site), cryptography ... the list goes on and on.
For a programmer, as an end user of a parallel machine, two options are possible: parallelism may or may not be parallel to it. In the first case, the programmer writes programs with a calculation on the computer architecture, takes into account the parameters of this particular machine. In the second, the programmer may not know that there is a non-classical computer in front of him, and all his classical methods manage to work due to the forethought of the system itself.
Here is an example of a multi-core processor - beloved by many SUN UltraSPARC T1 / T2:

8 cores, each supports 4 (for T1) or 8 (for T2) hardware threads (thread), which by simple multiplication gives us 32/64 (respectively) threads in the processor. Theoretically, this means that such a processor can perform 32/64 tasks of a single-threaded processor in the same time. But, of course, a lot of resources are spent on switching between threads and other “internal kitchen”.
And here, for example, the famous coolest graphic units like nVidia GeForce 9600 GT, which has 64 cores on board.

Latest GPUs have up to a thousand cores! We will talk about why graphic units have so many cores a bit later. And now look better at the example of the parallelism hierarchy in supercomputers:

It is clear that typing a bunch of powerful computers is not a problem. The problem is to make them work together, that is, to connect to a fast network. Often in such clusters use a high-speed switch and all computers simply connect to a fast local area network. Here, for example, is in our university:

256 cores: 64 units, each with 4 cores of 2.2 GHz and 8 GB of RAM each. To connect using the switch Foundry SuperX and Gigabit network. OS - Linux Rocks. The switch is often the most expensive element: try to find a fast switch on 64 ports! This is an example of one of the biggest problems of parallel computers: the speed of information exchange. Processors have become very fast, and memory and tires are still slow. Not to mention hard drives - compared to processors, they look like tools of the Stone Age!
Classification
Let's finally understand the types of parallel machines. They differ in the type of memory - shared (shared) or distributed (distributed), and the type of management - SIMD and MIMD. It turns out four types of parallel machines.
Common memory
A classic computer looks like this:

And the most obvious way to expand this circuit in view of several processors is this:

Instead of one processor, there are several; there is more memory, respectively, but in general, the same scheme. It turns out that all processors share common memory and if processor 2 needs some information on which processor 3 is working, then it will receive it from shared memory at the same speed.
Here’s what a Quad Pentium built like this looks like:

Distributed memory
As you probably yourself guessed, in this case, each processor has its own memory (let me remind you that this is not about the internal memory of the processor).

An example is the cluster described above: it is essentially a bunch of separate computers, each of which has its own memory. In this case, if the processor (or computer) 2 needs information from computer 3, then it will take more time: you will need to request information and transfer it (in the case of a cluster, over a local network). The network topology will affect the speed of information exchange, therefore different types of structures have been developed.
The simplest option that comes to mind is a simple two-dimensional array (mesh):

Or cube:

IBM Blue Gene, a descendant of the famous IBM Deep Blue, who defeated a man of chess, has a similar structure. Similar, because in fact, Blue Gene is not a cube, but a torus - the extreme elements are connected:

By the way, it was called Gene, because it is actively used in genetic research.
Another interesting structure that had to come to someone's head is the tree that everyone loved:

Since the depth (or height) of a binary tree is logn, the transmission of information from the two most distant nodes will cover a distance of 2 * logn, which is very good, but such a structure is still not used often. First, to divide such a network into two isolated subnets, it is enough to cut one wire (remember the min-cut problem?) In the case of a two-dimensional array, you need to cut the sqrt (n) wires! Do the cube yourself. And secondly, too much traffic passes through the root node!
In the 80s four-dimensional hypercubes were popular:

These are two three-dimensional cubes with connected vertices. They also built cubes of even greater dimension, but now they are hardly used, including because there is a huge amount of wires!
In general, the design of the network to solve a problem is an interesting topic. For example, the so-called Omega Network:

With memory sorted out, now about the management.
SIMD vs. MIMD
SIMD - Singe Instruction stream, Multiple Data stream. The control node is one, it sends instructions to all other processors. Each processor has its own set of data to work.
MIMD - Multiple Instruction stream, Multiple Data Stream. Each processor has its own control unit, each processor can execute different instructions.
SIMD-systems are usually used for specific tasks, requiring, as a rule, not so much the flexibility and versatility of a computer as the computational power itself. Media processing, scientific research (the same simulations and modeling), or, for example, Fourier transforms of giant matrices. Therefore, in graphic units, such a mad number of cores: these are SIMD systems, and a truly “smart” processor (as in your computer) is usually the same: it manages a bunch of simple and non-universal cores.

Since the “controlling” processor sends the same instructions to all the “working” processors, programming such systems requires some dexterity. Here is a simple example:
if (B == 0)
then C = A
else C = A/B
The initial state of the processor memory is:

The first line was executed, the data was considered, now the second line is started: then

In this case, the second and third processors do nothing, because the variable B is not suitable for them under the condition. Accordingly, the third line is executed next, and this time the other two processors “rest”:

Examples of SIMD machines are old ILLiac IV, MPP, DAP, CM-1/2 Connection Machine, modern vector units, specific co-processors, and graphic units like nVidia GPU.
MIMD machines have broader functionality, which is why they are used in our user computers. If you have at least a dual-core processor in a laptop - you are the lucky owner of a shared memory MIMD machine! MIMD distributed memory is supercomputers like IBM Blue Gene, which we talked about above, or clusters.
Here is the result of the whole classification:

On this introduction to the topic can be considered complete. Next time we will talk about how the speeds of parallel machines are calculated, write our first parallel program using recursion, run it in a small cluster and learn how to analyze its speed and resource consumption.