📜 ⬆️ ⬇️

Cloud Numerics - what is it?

Last week, I posted a note on the release of Microsoft's math library for working in the cloud.

( Link to the product itself )

I received many questions about why this library is needed, how it differs from many others, and how it works, and I decided to write more about it and give more examples. In this publication, I will give a retelling of previously published articles by my colleague Ronnie Huggervert and a simple example. In the following posts, I plan to give more complex examples of how Cloud Numerics works.
')
So, “Cloud Numerics” is a new .NET programming framework designed to perform intensive calculations on large distributed data arrays.

This framework consists of:

1. Regular and distributed containers for data arrays
2. Systems allowing to manipulate the distribution of data on cluster nodes in the cloud and perform parallel calculations on them
3. A wide range of library math functions that can be performed on a set of cluster nodes simultaneously
4. A set of utilities to simplify the deployment and execution of applications built on Cloud Numerics in Windows Azure

Systems using the Map / Reduce approach (such as Hadoop) were designed to greatly simplify the processing of large data sets. These systems provide a very simple programming model and a program operation subsystem that hides the details of scaling on huge clusters consisting of standard computing nodes. This simplified model is adequate for performing relational operations, clustering algorithms and machine learning on data large enough to not fit into the main memory of all cluster nodes.

However, these approaches are not always optimal for cases where the data can fit into the RAM of cluster nodes. Plus, by nature, interactive algorithms, or the algorithms most easily formalized in terms of operations on arrays, are quite hard expressed by software models like Map / Reduce. After all, the dynamically developing Hadoop ecosystem, within which many libraries for data analysis and machine learning like Mahout, Pegasus and HAMA were developed, does not use the potential of existing developed scalable linear algebra libraries like PBLAS and ScaLAPACK, which have been optimized and aligned for years.

At the same time, such libraries as the Message Passing Interface or MPI are ideal for efficient processing of data placed in RAM on large clusters, but are extremely difficult to program. The user of such a library should very carefully monitor the implementation of data transfer algorithms between cluster nodes and various parallel processes running inside them. If this is not done carefully enough, then the development of such “high-performance programs” can result in extremely low scalability and a high probability of unpredictable failures, freezes and crashes, after which recovery is impossible.

The abstractions and interfaces provided by “Cloud Numerics” do not contain any low-level structures for organizing parallel computing. Parallelism is implemented implicitly and is hidden from the user by operations on data types, such as distributed matrices. Hidden parallel operations lead to simple and efficient operation of the code and use the existing BLAS and ScaLAPACK libraries.

“Hello World” Example

For a brief illustration of the parallel program model “Cloud Numerics”, I will give an example in C # that loads a distributed matrix into memory, simultaneously calculates its own values ​​and prints a double rate and conditionality of the matrix.

var A = Distributed.IO.Loader.LoadData(csvReader);
var S = Decompositions.SvdValues(A);
var s0 = ArrayMath.Max(S);
var s1 = ArrayMath.Min(S);
Console.WriteLine("Norm: {0}, Condition Number: {1}", s0, s0 / s1);

.

Source: https://habr.com/ru/post/136953/


All Articles