Microsoft Azure ❤ Big Data

About six months ago, I published a retrospective of what is interesting for researchers happening in the Microsoft Azure cloud.

I will continue this topic, slightly shifting the focus to areas that for me the last couple of years have consistently remained the most interesting in IT: Big Data , machine learning and their symbiosis with cloud technologies .

Below we will discuss mainly the October announcements of Microsoft Azure services, providing the possibility of batch and real-time processing of large data arrays, a high-performance on-demand cluster, and broad support for machine learning algorithms.
')

Real time

Apache Storm in HDInsight

At the Strata + Hadoop World conference, held in October of this (2014) year, Apache Storm support was announced in
HDInsight (PaaS service that provides Hadoop on demand).

Apache Storm is a high-performance, scalable, fault-tolerant framework for distributed application execution in both near real-time and batch mode.

Cloudbus ServiceBus Queues or Event Hubs can serve as data sources in HDInsight Storm.

At the TechEd Europe 2014 conference, held in late October, the availability of Azure Cloudera Enterprise and Hortonworks Data Platform was announced in the form of pre-configured Azure VM (IaaS service).

Azure Event Hubs

Event Hubs , a highly scalable service capable of processing millions of requests per second in near real-time mode, was launched into commercial operation.

The main features of the service (except those already listed in the definition):

volume of incoming requests:> 1 GB per second;
number of event producers:> 1 million;
HTTP (S), AMQP protocol support;
Elastic scaling up / down without downtime;
time-based event buffering with orderliness;
limits depending on the plan.

Azure Stream Analytics

Stream Analytics is an event-processing engine (engine) that allows you to process a large number of events in real time. As befits a decent cloud service, Stream Analytics copes with the load of more than 1M requests per second, it elastically scales, supports several data sources - Azure Blob Storage and Event Hubs. Transformation rules in Stream Analytics are written in a SQL-like language (as much as you can! - another SQL-like query language).

To work with the service, you need to make actions similar to what was done in the Azure Data Factory (about it a little later) - create the service itself, define the input data sources and output streams through the web interface, write a query query for data transformation.

This is how a request for the formation of second candles for orders to buy / sell shares coming from an abstract exchange will look like:

SELECT DateAdd(second,-1,System.TimeStamp) as OpenTime, system.TimeStamp as CloseTime, Security, Max(Price) as High, Min(Price) as Low, Count(Volume) as TotalVolume FROM trades GROUP BY TumblingWindow(second, 1), Security

The concept and query of Stream Analytics will remind me of the product Stream Insight. There is not much service documentation, but “ Get started ... ” has already been written.

HPC

Azure batch

Azure Batch is a cluster-on-demand service. The service allows you to write a highly scalable application that uses high-performance nodes to run multiple identical tasks.

Cases of using the service for Azure Batch are traditional for the Big Data world - tasks can work with a large amount of data (more than it can fit into the RAM of a single node). Applications: genetic engineering, banking, financial exchanges, retail, telecom, healthcare, government agencies and commercial web services that accumulate large amounts of data in their work.

In addition, Azure Batch is suitable for tasks that require more CPU power (on the CPUs of a single node, the computing would take more time). Such tasks include tasks of rendering and image processing, (de) video coding.

Below is an example of a framework that you need to implement to run an application in Azure Batch (I specify classes with a namespace, so that the origin of the classes is obvious).

The code for the server side will look something like this:

 public class ApplicationDefinition { /// <summary> ///   /// </summary> public static readonly Microsoft.Azure.Batch.Apps.Cloud.CloudApplication Application = new Microsoft.Azure.Batch.Apps.Cloud.ParallelCloudApplication { ApplicationName = "StockTradingAnalyzer", JobType = "StockTradingAnalyzer", JobSplitterType = typeof(MyApp.TradesJobSplitter), TaskProcessorType = typeof(MyApp.TradesTaskProcessor) }; } public class TradesJobSplitter : Microsoft.Azure.Batch.Apps.Cloud.JobSplitter { /// <summary> ///  Job      /// </summary> /// <returns> ,    </returns> protected override IEnumerable<TaskSpecifier> Split(IJob job, JobSplitSettings settings) { /* split job here */ } } public class TradesTaskProcessor: Microsoft.Azure.Batch.Apps.Cloud.ParallelTaskProcessor { /// <summary> ///      /// </summary> /// <param name="task"> </param> /// <param name="settings">   </param> /// <returns>  </returns> protected override TaskProcessResult RunExternalTaskProcess(ITask task, TaskExecutionSettings settings) { /* some magic */ } /// <summary> ///      (tasks)    Job /// </summary> /// <param name="mergeTask">  </param> /// <param name="settings">   </param> /// <returns>  Job</returns> protected override JobResult RunExternalMergeProcess(ITask mergeTask, TaskExecutionSettings settings) { /* yet another magic */ } }

Call on the client:

 Microsoft.WindowsAzure.TokenCloudCredentials token = GetAuthenticationToken(); string endpoint = ConfigurationManager.AppSettings["BatchAppsServiceUrl"]; //   using (var client = new Microsoft.Azure.Batch.Apps.BatchAppsClient(endpoint, token)) { //  Job var jobSubmission = new Microsoft.Azure.Batch.Apps.JobSubmission() { Name = "StockTradingAnalyzer", Type = "StockTradingAnalyzer", Parameters = parameters, RequiredFiles = userInputFilePaths, InstanceCount = userInputFilePaths.Count }; Microsoft.Azure.Batch.Apps.IJob job = await client.Jobs.SubmitAsync(jobSubmission); //  Job await MonitorJob(job, outputDirectory); //       Job,  ,     }

Subjectively, I would like, of course, some higher levels of abstraction and exotics such as acyclic execution graphs and something like Distributed LINQ, i.e. as it is done in the Naiad and Dryad projects ( in Russian ) from Microsoft Research.

A small overview of the service is already available on azure.microsoft.com. For those who want to count the words can look into the next tutorial .

Examples with working code (and not those “abstractions” that I wrote above) are already available on code.msdn.com (C #) and GitHub (Python).

Azure VM D-series

Referring to HPC, I’ll note the recent announcement of the Azure VM D-series . As always, it happens in such announcements added plus nn% to what has already happened. But if we come back to the iron facts, then computing nodes with 16x CPU, 112 Gb RAM and 800 Gb SSD for ~ $ 2.5 per hour became available (the price is current at the end of October 2014).

And more ... (did not invent the category)

Azure Data Factory

Azure Data Factory is a service that provides orchestration and data stream transformation tools, monitoring data sources such as MS SQL Server, Azure SQL Database, Azure Blobs, Tables.

The idea is intuitively simple (which counts as a plus): we create a service; bind input and output data sources; create a pipeline (one or several activities, accepting input data, manipulating them and recording them into the output stream) and look. The result will look like this . It simplifies the work with the services that all steps, except the creation of the pipeline, can be done through the web-interface.

Examples of working with the service are already on GitHub , and the “ Step by step walkthrough ” is written.

Machine learning

The fact that I will write in this section is not at all the “hot” news, but it's still nice to repeat.

The most important thing: the first step has been taken - the Azure ML public preview service is available in Azure. Support for R and more than 400 (maybe not 400, but definitely a lot) of pre-installed packages for R is claimed. Currently there is also support for Vowpal Wabbit modules.

In addition, the Azure for Research Award Program has opened for research projects, the main goal of which is to make the world better, and a side goal to popularize among the academic community of the Azure platform, in general, and the Azure ML service, in particular. I quite like both goals (especially since no one forbids Microsoft’s cloud competitors from doing the same).

Also on the topic of machine learning, Microsoft held a number of interesting events.

Events

The train left (events have passed, but perhaps the records remain)

In September, Microsoft Machine Learning Hackathon 2014 was held .
In mid-October, TechEd Europe 2014 conference was held, about which XaocCPS already wrote on Habré.
In late October, Microsoft Azure Machine Learning Jump Start took place .

The train is coming (and you can still jump into it!)

November 15-16, Moscow will host the Big Data Hakaton , about which he has already put in the word ahriman .

Appreciate the data. Explore. Open!

Source: https://habr.com/ru/post/242403/

All Articles