Introduction to HDInsight

HDInsight Services for Windows Azure is a service that allows you to work with an Apache Hadoop cluster in the Cloud, providing a software environment for managing, analyzing and reporting on Big Data.

I will not dwell on the possibilities of Hadoop. It was first introduced in 2005 as part of the Apache Software Foundation project and is a software platform for the distributed processing of large amounts of data. For example, petabyte size is not an obstacle for it. The Hadoop platform is based on the HDFS (Hadoop Distributed File System) file system implemented on a Hadoop cluster. The cluster includes nodes that store fragments of files (DataNode). Theoretically, there can be hundreds and thousands of such nodes based on low-cost computing platforms (commodity hardware). To ensure high reliability, redundancy is maintained by creating copies of fragments between nodes. The knowledge of which data node on which replica rests is NameNode. On the client side, this looks like a regular tree-like file system. The NameNode itself does not perform basic I / O operations. It only provides the client with metadata about the location of the primary replica of the fragment. Fragment replication is done automatically. In case of failure of the primary replica of a fragment, one of its secondary replicas is assigned to the primary and another copy is also automatically created on the secondary node. Scalability for large amounts of data is achieved through parallel processing of fragments. Historically, the development of Google Labs project Hadoop was focused on the task of searching and classifying Internet content. For example, the Map function takes a data set as input and converts it into a list of key / value pairs. The Reduce function performs the inverse operation, collapsing the list by grouping it by key. For the purposes of parallelization, multiple instances of such functions can be created, each processing its own fragment. The nodes where the input fragments of the files are stored and the instances of MapReduce processing them are launched are called TaskTracker, and the node coordinating the instances is JobTracker. The number of copies is determined by the number and location of the fragments. In addition to search engines, many other types of data processing tasks fall under this template. There are Pig, Hive, Mahout, Pegasus and other projects built on top of HDFS and MapReduce, which provide a higher level of abstraction and allow you to solve data flow control, query, analytical tasks, as well as data mining hidden tasks (data warehouses), which are traditionally built on database management servers, a relational model, and one or another SQL query language dialect. The interaction is no less traditional with the help of ODBC drivers.
Last fall, at the Pass Summit 2011 conference in Seattle, the release of the Hadoop Connector for Microsoft SQL Server was announced, facilitating the exchange of data between the two systems. In addition, a preliminary evaluation version of Windows Azure HDInsight Service and Microsoft HDInsight Server for Windows is currently being offered in partnership with HortonWorks, 100% compatible with the open standards of Apache Hadoop. Download HDInsight Server for Windows here . To try HDInsight Service in the Cloud, you need to register for testing here .
As a prerequisite, you must have a Microsoft cloud account. Accounts within MSDN, BizSpark, DreamSpark programs work. Within the preliminary version, it is available to create a Hadoop cluster of 3 nodes with a total disk space of 1.5 TB. The cluster will live 5 days from the moment of creation. After that, all configuration and content will be lost, you will have to re-create it. From the initial data you need to specify the DNS-name (it, of course, must be unique) and administrative login / password. We will not need to use the Windows Azure SQL Database to store metadata at first, but just in case, please note that such a possibility exists and the base (in case you want to use it) must be created in advance. Click on the Request Cluster button in the bottom right of the screen:

Pic1

A few minutes pass and Cluster Status = Deploying is changed to Running, after which it can be used.
')

Pic2

Click on the link Go to Cluster. From the Web interface, you can go to an interactive console for executing JavaScript and Hive commands, a remote access session, configure ports for interaction via ODBC, create tasks, view tasks execution history, familiarize yourself with typical examples of using Hadoop. Using the Downloads button, it is currently possible to install HiveODBC drivers on the x86 or x64 local machine. The Manage Cluster button allows you to control the size of the used disk space, as well as specify folders in the Windows Azure BLOB Service, which can be considered as an Azure Storage Vault, alternative to cluster disk space for native Hadoop processes. For example, as an input and output location for MapReduce. If something is fatally messed up, the cluster can be re-created by going to www.hadooponazure.com and pressing the Release Cluster button.

Pic.3

Establish a connection via Remote Desktop by clicking on the appropriate tile in the portal screen. For authorization, the account specified in Fig.1 is used.

Pic.4

You can see that the 64-bit edition of Windows Server 2008R2 Enterprise SP1 is used as the base operating system. It is installed on the D: partition. To open the Hadoop command window, run Start -> Run ->

D:\Windows\system32\cmd.exe /k pushd "c:\apps\dist\hadoop-1.1.0-SNAPSHOT" && "c:\apps\dist\hadoop-1.1.0-SNAPSHOT\bin\hadoop.cmd"

Create a directory for future experiments in HDFS and a subdirectory where the input data will be placed:

hadoop fs -mkdir Sample1/input

To get java FsShell online help, type hadoop fs -help.
Transfer the Sample.log file to the input subdirectory, which will be needed to further illustrate how Hadoop works. This file is an abstract log of a slightly structured format containing lines with TRACE, DEBUG, INFO, FATAL, etc. It can be taken from the HortonWorks examples at gettingstarted.hadooponazure.com/hw/sample.log . This is not a terabyte log, it has a modest size of ~ 100 KB, but for illustration, say, MapReduce will fit. For simplicity, let's download it initially to the Windows directory on the HDInsight cluster, say, d: \ Temp. Internet on a Windows machine with which a remote connection is established. Immediately you will be asked to update Internet Explorer, but for our subsequent tasks this is not essential. Download Sample.log in HDFS. To copy from the local file system, use the switch switch:

hadoop fs -put d:\Temp\Sample.log Sample1/input/

Make sure it is loaded:

hadoop fs -ls Sample1/input/

Pic.5

Next, the basic features of MapReduce will be discussed using the example of Sample.log analysis.

Source: https://habr.com/ru/post/165185/

All Articles

Introduction to HDInsight

More articles: