Analyzing Twitter data in the cloud using Apache Hadoop and Hive

This guide describes the procedures for requesting, exploring and analyzing Twitter data using services based on Apache Hadoop for Windows Azure, as well as a Hive query in Excel. Social media is the main source of big data. Therefore, publicly accessible APIs for social media such as Twitter provide useful information and help you better understand network trends.

The manual consists of the following sections.
')

Search, download, install and use Microsoft Analytics for Twitter
Getting Twitter feeds using cURL and Twitter Streaming API
Querying and configuring new Hadoop on a Windows Azure cluster
Processing Twitter data using Hive to Hadoop in a Windows cluster
Configure Hive ODBC and Hive Panel in Excel to get Hive data

Search, download, install and use Microsoft Analytics for Twitter

Microsoft Analytics for Twitter is available for download at the following address: http://www.microsoft.com/download/en/details.aspx?id=26213 .

Excel 2010 and PowerPivot add-on (available at http://www.microsoft.com/download/en/details.aspx?id=29074 ) are required.

clip_image004[4]

Insert the following movie tags and accounts into the query window:

#moneyball, @MoneyballMovie, @helpmovie, @BridesmaidsSay, @CONTAGION_movie

Press the button

and follow the instructions on the screen.

Note. Hadoop is not used in this section. It shows how to work with the Twitter Search API and business intelligence with self-service in Excel and PowerPivot.

Getting a Twitter feed using cURL and Twitter Streaming API

At this stage, you will need the application curl.exe. Download the curl file for your OS (for example, the SSL binary file for 64-bit Windows) at http://curl.haxx.se/download.html

clip_image008[4]

and unzip curl.exe into a suitable folder (for example, C: \ twitterdata ).

Copy the two files — get twitter stream.cmd and twitter_params.txt — from the Step2GetTwitterFeedUsingCURLAndTwitterStreamingAPI folder to the folder containing curl.exe :

clip_image010[4]

Modify the twitter_params.txt file as follows to track tweets:

track = moneyball, MoneyballMovie, helpmovie, BridesmaidsSay, CONTAGION_movie

Modify the get twitter stream stream.cmd command script by inserting your Twitter username instead of USER and password instead of PASSWORD on the following line:

curl -d @ twitter_params.txt -k stream.twitter.com/1/statuses/filter.json -uUSER: PASSWORD >> twitter_stream_seq.txt

Run the get twitter stream.cmd script from the command line as follows:

Some information will appear on the screen:

To end the script, press Ctrl + C. You can then rename the file and rerun the script.

Querying and configuring new Hadoop on a Windows Azure cluster

At this stage, you need CTP access for the Apache Hadoop service in Windows Azure. Go to https://www.hadooponazure.com/ and click the invitation link. If you have access, click the Sign in button.

clip_image016[4]

Request a new cluster. Below is an example of a large cluster called mailboxpeak. Enter your username and password, and then click the Request cluster button. For any questions, refer to the “Instructions and Frequently Asked Questions for a Service Based on Apache Hadoop for Windows Azure.

clip_image018[4]

Open FTPS and ODBC ports to access the server.

clip_image020[4]

Click the Interactive Console icon.

clip_image024[4]

Create a directory for the Twitter text file on HDFS using the following Javascript command:

js> #mkdir / example / data

To download trial text files, run the following commands:

js> #put
Source: C: \ hadoop \ example \ data \ Sample.txt
Destination: / examples / data

To upload large (uncompressed) text files directly to HDFS, you will need the curl.exe file. If this file does not exist, download it according to the instructions in step 2 and unpack it into a suitable folder, for example, here: C: \ hadoop \ example \ data. Then open PowerShell, go to C: \ hadoop \ example \ data and paste the following FTPS PowerShell script into the SampleData text file ( SampleData.txt ):

C: \ hadoop \ example \ data>

# ----- begin curl ftps to hadoop on azure powershell example -
# ------ Replace XXXXXXX with the appropriate servername / username / password

$ serverName = "XXX.cloudapp.net"; $ userName = "XXXX";
$ password = "XXXXXXXX";
$ fileToUpload = "SampleData.txt"; $ destination = "/ example / data /";
$ Md5Hasher = [System.Security.Cryptography.MD5] :: Create ();
$ hashBytes = $ Md5Hasher.ComputeHash ($ ([Char []] $ password))
foreach ($ byte in $ hashBytes) {$ passwordHash + = "{0: x2}" -f $ byte}
$ curlCmd = ". \ curl -k --ftp-create-dirs -T $ fileToUpload -u $ userName"
$ curlCmd + = ": $ passwordHash ftps: // $ serverName" + ": 2226 $ destination"
invoke-expression $ curlCmd

# ----- end curl ftps to az on powersure example

clip_image026[4]

Very large files must be compressed before downloading. A compressed file (with a .gz extension, etc.) can be uploaded to a Windows Azure storage account. Using CloudXplorer ( http://clumsyleaf.com/products/cloudxplorer ), download the file as follows:

After setting up your Windows Azure storage account and installing CloudXplorer, go to the Windows Azure portal and copy the primary access key of your storage account by clicking the View button in the right column.

clip_image028[4]

Then open CloudXplorer and select File -> Manage Accounts . A new dialog box will open. Click New and select Windows Azure account .

clip_image030[4]

In the next dialog box, paste in the name of the storage account you specified when setting up the storage account (for example, hadoopdemo) and the copied access key.

clip_image032[4]

Create a container in the new storage account (in Windows Azure, the directories are called containers ).

clip_image034[4]

Download (copy) the ZIP archive into a container (in our case, the container is called data ).

clip_image036[4]

Set up a Windows Azure blob storage account by clicking the Manage Data icon

clip_image038[4]

next to Set up ASV .

clip_image040[4]

Now you need the name of the Windows Azure Storage Account Name (in our case, it hadooopdemo ) and the main access key.

clip_image042[4]

Enter the Windows Azure storage account name and primary access key, and then click Save settings .

clip_image044[4]

Processing Twitter data using Hive to Hadoop in a Windows cluster

Go to https://www.hadooponazure.com/ . Connect to the Hadoop headend by clicking Remote Desktop .

clip_image046[4]

Click the Open button.

Log in to the remote server with the username and password that you used when creating the cluster in step 3.

Create a directory (for example, c: \ Apps \ dist \ example \ data ) on the server of the head node of the remote Hadoop (on the NTFS side) using Explorer or the command line, and then go to it.

Copy the entire contents of the CopyToHeadnode folder to the new directory. This includes the file HiveUDFs.jar (user-defined functions for Hive queries), gzip , and the text files for Hive queries. In addition, copy the All steps to run from the Hadoop Command Shell.txt file to simplify the execution of the last part of this step.

clip_image050[4]

RDP supports copying between hosted and remote desktop. Sometimes Hadoop unpacks the gzip file while it is being copied to HDFS.

Open the Hadoop Command Shell on the remote desktop.

clip_image052[4]

Change directory to c: \ Apps \ dist \ example \ data .

clip_image054[4]

Copy the twitter stream file seq8.gz from the Windows Azure storage to the c: \ Apps \ dist \ example \ data folder (on the NTFS side). The location of the file in the storage account depends on the Windows Azure storage associations specified in step 3. In our case, the container is called data and is displayed in the line under asv: // :

c: \ Apps \ dist \ example \ data> hadoop fs -copyToLocal asv: //data/twitter_stream_seq8.txt.gz twitter_stream_seq8.txt.gz

Unzip the twitter stream seq8.gz archive into the c: \ Apps \ dist \ example \ data folder as shown below (you will need the gzip.exe program, which you need to download from http://www.gzip.org/ and place to the directory from which the command is executed):

c: \ Apps \ dist \ example \ data> gzip -d -N twitter_stream_seq8.txt.gz

Note. Sometimes Hadoop unpacks the file when copying to HDFS, but it only works for .bz2 (bzip2) archives http://bzip.org/ :

hadoop fs -copyFromLocal twitter_stream_seq8.txt.gz /example/data/twitter_stream_seq8.txt

Copy twitter stream seq8.txt from the c: \ Apps \ dist \ example \ HDFS folder with the following command:

c: \ Apps \ dist \ example \ data>

hadoop fs -copyFromLocal twitter_stream_seq8.txt /example/data/twitter_stream_seq8.txt

1. Make sure the file on the HDFS is updated. To do this, open

clip_image056[4]

and go to the folder / example / data.

clip_image058[4]

The following steps are contained in the All steps to run from the Hadoop Command Shell.txt file that you copied to the head node.

Create and upload twitter_raw with the following command:

c: \ apps \ dist \ example \ data> hive -v -f load_twitter_raw.txt

clip_image060[4]

The table will be created in the / hive / warehouse directory on the HTFS side:

clip_image062[4]

You can check this with Hive by typing c: \ Apps \ dist \ example \ hive and Hive> show tables; as shown below.

clip_image064[4]

To exit Hive, use the command hive> quit; . Create and upload twitter_temp as follows:

c: \ apps \ dist \ example \ data> hive -v -f create_twitter_temp.txt

If there are 4 nodes, this operation will take more than 20 minutes, and if there are 8 nodes - 8 minutes 55 seconds. Check the progress in the following window:

clip_image066[4]

Click a task to view details and progress. The operation may take more than 20 minutes.

clip_image068[4]

You can also track task execution using the Hadoop Command Shell:

clip_image070[4]

You can check this with Hive by typing c: \ Apps \ dist \ example \ hive and Hive> show tables :

clip_image072[4]

Create and upload twitter_stream as follows:

c: \ apps \ dist \ example \ data> hive -v -f create_twitter_stream.txt

If there are 4 nodes, this operation will take more than 60 minutes, and if there are 8 nodes, it will take 31 minutes 54 seconds. Track progress as described above. Create and download the twitter stream sample with the following command:

c: \ apps \ dist \ example \ data> hive -v -f create_twitter_stream_sample.txt

Track progress as described above. Create and upload twitter_movies as follows:

c: \ apps \ dist \ example \ data> hive -v -f create_twitter_movies.txt

Track progress as described above. Create and download twitter movies vw with the command:

c: \ apps \ dist \ example \ data> hive -v -f create_twitter_movies_vw.txt

Track progress as described above.

Configure Hive ODBC and Hive Panel in Excel to get Hive data

This section is taken from the “Instructions and Frequently Asked Questions for an Apache Hadoop Based Service for Windows Azure,” which is on the download tile.

clip_image074[4]

in Hadoop on the Windows Azure portal.

clip_image076[4]

From there you can also download HiveODBCSetup for 64-bit and 32-bit versions of Excel.

How to connect to the Hive add-in for Excel in Hadoop on a Windows Azure platform using HiveODBC

The most important feature of the Microsoft big data processing solution is the integration of Hadoop with the components of Microsoft's business intelligence. A good example of this is to connect Excel to the Hive data storage framework in a Hadoop cluster. This section shows how to use Excel through the Hive ODBC driver.

Installing the Hive ODBC driver

To start the installation, download the 64-bit version of the Hive ODBC driver (MSI file) from Hadoop in the Windows Azure portal. Double-click HiveODBCSetupx64.msi to start the installation. Read the license agreement. If you agree to its terms, click I accept and then Install .

clip_image080[4]

After the installation is complete, click Finish to exit the wizard.

Installing Hive Add-in for Excel

To install this add-in, you need 64-bit versions of the Hive ODBC driver and Excel 2010 software. Start the 64-bit version of Excel 2010. The system will offer to install the HiveExcel extension. Click Install . When the extension is installed, click the Data tab in Microsoft Excel 2010. The Hive panel opens, as shown in the following screen shot:

clip_image082[4]

Creating a Hive ODBC Data Source for Excel

Select Start -> Control Panel to start the Microsoft Windows Control Panel . In the control panel window, select System and Security -> Administrative Tools -> Data Sources (ODBC) . The ODBC Data Source Administrator dialog box appears .

clip_image084[4]

In the ODBC Data Source Administrator dialog box, select the System DSN tab. Click Add to create a data source. Select the HIVE driver in the list of ODBC drivers.

clip_image086[4]

Click Finish . The ODBC Hive Setup dialog box appears, as shown in the screenshot below.

clip_image088[4]

Enter a name in the Data Source Name field. For example, MyHiveData . In the Host field, enter the name of the cluster node created in the portal. For example, myhadoopcluster.cloudapp.net . Specify the username for authentication on the portal.

Click OK to save the Hive data source. Click OK to close the ODBC Data Source Administrator dialog box.

Getting Hive Data in Excel

Start the 64-bit version of Excel 2010. Then go to the Data tab. Click the Hive Panel to open the Hive panel in Excel. In the Select or Enter Hive Connection drop-down list, specify the name of the data source created earlier.

The system will ask you to enter a password for authentication in the cluster on the portal. Enter your password and username. In the Select the Hive Object to Query drop-down list, select hivesampletable [Table] . Check all columns in the table. The Hive Query panel should look something like this:

clip_image090[4]