
This guide describes the procedures for requesting, exploring and analyzing Twitter data using services based on Apache Hadoop for Windows Azure, as well as a Hive query in Excel. Social media is the main source of big data. Therefore, publicly accessible APIs for social media such as Twitter provide useful information and help you better understand network trends.
The manual consists of the following sections.
')
- Search, download, install and use Microsoft Analytics for Twitter
- Getting Twitter feeds using cURL and Twitter Streaming API
- Querying and configuring new Hadoop on a Windows Azure cluster
- Processing Twitter data using Hive to Hadoop in a Windows cluster
- Configure Hive ODBC and Hive Panel in Excel to get Hive data
Search, download, install and use Microsoft Analytics for Twitter
Microsoft Analytics for Twitter is available for download at the following address:
http://www.microsoft.com/download/en/details.aspx?id=26213 .
Excel 2010 and PowerPivot add-on (available at
http://www.microsoft.com/download/en/details.aspx?id=29074 ) are required.
![clip_image004 [4] clip_image004[4]](https://habrastorage.org/getpro/habr/post_images/cb8/632/205/cb8632205c8f71be79613c70d1112e20.jpg)
Insert the following movie tags and accounts into the query window:
#moneyball, @MoneyballMovie, @helpmovie, @BridesmaidsSay, @CONTAGION_movie
Press the button
![clip_image006 [4] clip_image006[4]](https://habrastorage.org/getpro/habr/post_images/189/fb5/7e7/189fb57e743c26e531eda2fc228381ba.jpg)
and follow the instructions on the screen.
Note. Hadoop is not used in this section. It shows how to work with the Twitter Search API and business intelligence with self-service in Excel and PowerPivot.
Getting a Twitter feed using cURL and Twitter Streaming API
At this stage, you will need the application curl.exe. Download the curl file for your OS (for example, the SSL binary file for 64-bit Windows) at
http://curl.haxx.se/download.html![clip_image008 [4] clip_image008[4]](https://habrastorage.org/getpro/habr/post_images/08f/78f/811/08f78f811d0a24535f76ec1f198337b4.jpg)
and unzip
curl.exe into a suitable folder (for example,
C: \ twitterdata ).
Copy the two files —
get twitter stream.cmd and
twitter_params.txt — from the
Step2GetTwitterFeedUsingCURLAndTwitterStreamingAPI folder to the folder containing
curl.exe :
![clip_image010 [4] clip_image010[4]](https://habrastorage.org/getpro/habr/post_images/5f6/33e/0e0/5f633e0e0035a8a424c86f6a87c48224.jpg)
Modify the
twitter_params.txt file as follows to track tweets:
track = moneyball, MoneyballMovie, helpmovie, BridesmaidsSay, CONTAGION_movieModify the
get twitter stream stream.cmd command script by inserting your Twitter username instead of
USER and password instead of
PASSWORD on the following line:
curl -d @ twitter_params.txt -k stream.twitter.com/1/statuses/filter.json -uUSER: PASSWORD >> twitter_stream_seq.txtRun the get
twitter stream.cmd script from the command line as follows:
![clip_image012 [4] clip_image012[4]](https://habrastorage.org/getpro/habr/post_images/6b2/1c0/a6b/6b21c0a6ba9f50b27e17fe13fd70aabb.jpg)
Some information will appear on the screen:
![clip_image014 [4] clip_image014[4]](https://habrastorage.org/getpro/habr/post_images/eb5/4b7/4fe/eb54b74fefd3580935a53e8c93e86806.jpg)
To end the script, press
Ctrl + C. You can then rename the file and rerun the script.
Querying and configuring new Hadoop on a Windows Azure cluster
At this stage, you need CTP access for the Apache Hadoop service in Windows Azure. Go to
https://www.hadooponazure.com/ and click the
invitation link. If you have access, click the
Sign in button.
![clip_image016 [4] clip_image016[4]](https://habrastorage.org/getpro/habr/post_images/03d/444/ae3/03d444ae378a3eaa373d01d287dda787.jpg)
Request a new cluster. Below is an example of a large cluster called mailboxpeak. Enter your username and password, and then click the Request cluster button. For any questions, refer to the “Instructions and Frequently Asked Questions for a Service Based on Apache Hadoop for Windows Azure.
![clip_image018 [4] clip_image018[4]](https://habrastorage.org/getpro/habr/post_images/d8d/107/20a/d8d10720a92cecd2dbbcc7dc1e6f1d67.gif)
Open FTPS and ODBC ports to access the server.
![clip_image020 [4] clip_image020[4]](https://habrastorage.org/getpro/habr/post_images/1b4/374/7cc/1b43747cc87f93a8b7b633ad8f556c09.jpg)
![clip_image022 [4] clip_image022[4]](https://habrastorage.org/getpro/habr/post_images/738/206/3f5/7382063f590b61d826458766088ae3f3.jpg)
Click the
Interactive Console icon.
![clip_image024 [4] clip_image024[4]](https://habrastorage.org/getpro/habr/post_images/0bf/d9d/da7/0bfd9dda7b21a54b0c61d483bf621bb8.jpg)
Create a directory for the Twitter text file on HDFS using the following Javascript command:
js> #mkdir / example / dataTo download trial text files, run the following commands:
js> #put
Source: C: \ hadoop \ example \ data \ Sample.txt
Destination: / examples / dataTo upload large (uncompressed) text files directly to HDFS, you will need the curl.exe file. If this file does not exist, download it according to the instructions in step 2 and unpack it into a suitable folder, for example, here: C: \ hadoop \ example \ data. Then open PowerShell, go to
C: \ hadoop \ example \ data and paste the following FTPS PowerShell script into the SampleData text file (
SampleData.txt ):
C: \ hadoop \ example \ data># ----- begin curl ftps to hadoop on azure powershell example -
# ------ Replace XXXXXXX with the appropriate servername / username / password$ serverName = "XXX.cloudapp.net"; $ userName = "XXXX";
$ password = "XXXXXXXX";
$ fileToUpload = "SampleData.txt"; $ destination = "/ example / data /";
$ Md5Hasher = [System.Security.Cryptography.MD5] :: Create ();
$ hashBytes = $ Md5Hasher.ComputeHash ($ ([Char []] $ password))
foreach ($ byte in $ hashBytes) {$ passwordHash + = "{0: x2}" -f $ byte}
$ curlCmd = ". \ curl -k --ftp-create-dirs -T $ fileToUpload -u $ userName"
$ curlCmd + = ": $ passwordHash ftps: // $ serverName" + ": 2226 $ destination"
invoke-expression $ curlCmd# ----- end curl ftps to az on powersure example![clip_image026 [4] clip_image026[4]](https://habrastorage.org/getpro/habr/post_images/ffa/5b3/294/ffa5b3294897f3acc5bae84754112b63.jpg)
Very large files must be compressed before downloading. A compressed file (with a .gz extension, etc.) can be uploaded to a Windows Azure storage account. Using CloudXplorer (
http://clumsyleaf.com/products/cloudxplorer ), download the file as follows:
After setting up your Windows Azure storage account and installing CloudXplorer, go to the Windows Azure portal and copy the primary access key of your storage account by clicking the
View button in the right column.
![clip_image028 [4] clip_image028[4]](https://habrastorage.org/getpro/habr/post_images/42b/278/1ef/42b2781efdfc26c394fa2d5219cfb978.jpg)
Then open CloudXplorer and select
File -> Manage Accounts . A new dialog box will open. Click
New and select
Windows Azure account .
![clip_image030 [4] clip_image030[4]](https://habrastorage.org/getpro/habr/post_images/49b/4ea/42e/49b4ea42e2ac038e604f61b069f32209.jpg)
In the next dialog box, paste in the name of the storage account you specified when setting up the storage account (for example, hadoopdemo) and the copied access key.
![clip_image032 [4] clip_image032[4]](https://habrastorage.org/getpro/habr/post_images/799/162/626/79916262665a4cc09bc26ad6bb5fa447.jpg)
Create a container in the new storage account (in Windows Azure, the directories are called
containers ).
![clip_image034 [4] clip_image034[4]](https://habrastorage.org/getpro/habr/post_images/9a9/501/480/9a9501480249fe142d87e64ef6c84c6d.jpg)
Download (copy) the ZIP archive into a container (in our case, the container is called
data ).
![clip_image036 [4] clip_image036[4]](https://habrastorage.org/getpro/habr/post_images/40d/6a3/a60/40d6a3a6095bcaab18b917488b70712e.jpg)
Set up a Windows Azure blob storage account by clicking the
Manage Data icon
![clip_image038 [4] clip_image038[4]](https://habrastorage.org/getpro/habr/post_images/07e/bd2/96e/07ebd296e0f1012c0512b566cf7b59cb.jpg)
next to
Set up ASV .
![clip_image040 [4] clip_image040[4]](https://habrastorage.org/getpro/habr/post_images/ab1/df5/a2c/ab1df5a2ccbfd58a8620beb1ac814940.jpg)
Now you need the name of the Windows Azure
Storage Account Name (in our case, it
hadooopdemo ) and the main access key.
![clip_image042 [4] clip_image042[4]](https://habrastorage.org/getpro/habr/post_images/53e/2ce/309/53e2ce30999a7ec5d4b0feb9b72be6de.jpg)
Enter the Windows Azure storage account name and primary access key, and then click
Save settings .
![clip_image044 [4] clip_image044[4]](https://habrastorage.org/getpro/habr/post_images/b7f/d37/01c/b7fd3701c8cb9dc04416339ea8dbaecb.jpg)
Processing Twitter data using Hive to Hadoop in a Windows cluster
Go to
https://www.hadooponazure.com/ . Connect to the Hadoop headend by clicking
Remote Desktop .
![clip_image046 [4] clip_image046[4]](https://habrastorage.org/getpro/habr/post_images/ba6/d3c/a1c/ba6d3ca1cbafebcfe2dba5f15c361a42.jpg)
Click the Open button.
![clip_image048 [4] clip_image048[4]](https://habrastorage.org/getpro/habr/post_images/8ea/0d5/1ef/8ea0d51eff8a428e6750cc87b557f5b9.jpg)
Log in to the remote server with the username and password that you used when creating the cluster in step 3.
Create a directory (for example,
c: \ Apps \ dist \ example \ data ) on the server of the head node of the remote Hadoop (on the NTFS side) using Explorer or the command line, and then go to it.
Copy the entire contents of the CopyToHeadnode folder to the new directory. This includes the file
HiveUDFs.jar (user-defined functions for Hive queries),
gzip , and the text files for Hive queries. In addition, copy the
All steps to run from the Hadoop Command Shell.txt file to simplify the execution of the last part of this step.
![clip_image050 [4] clip_image050[4]](https://habrastorage.org/getpro/habr/post_images/368/3bd/102/3683bd102ef35867b35dd889bb756b6a.jpg)
RDP supports copying between hosted and remote desktop. Sometimes Hadoop unpacks the
gzip file while it is being copied to HDFS.
Open the Hadoop Command Shell on the remote desktop.
![clip_image052 [4] clip_image052[4]](https://habrastorage.org/getpro/habr/post_images/8dc/8bc/8c8/8dc8bc8c8d2dbeeaa53c11ef25451bd6.jpg)
Change directory to
c: \ Apps \ dist \ example \ data .
![clip_image054 [4] clip_image054[4]](https://habrastorage.org/getpro/habr/post_images/9f1/e40/f0c/9f1e40f0cb946e363c240bf24270ab84.jpg)
Copy the
twitter stream file
seq8.gz from the Windows Azure storage to the
c: \ Apps \ dist \ example \ data folder (on the NTFS side). The location of the file in the storage account depends on the Windows Azure storage associations specified in step 3. In our case, the container is called
data and is displayed in the line under
asv: // :
c: \ Apps \ dist \ example \ data> hadoop fs -copyToLocal asv: //data/twitter_stream_seq8.txt.gz twitter_stream_seq8.txt.gzUnzip the
twitter stream seq8.gz archive into the
c: \ Apps \ dist \ example \ data folder as shown below (you will need the
gzip.exe program, which you need to download from
http://www.gzip.org/ and place to the directory from which the command is executed):
c: \ Apps \ dist \ example \ data> gzip -d -N twitter_stream_seq8.txt.gzNote. Sometimes Hadoop unpacks the file when copying to HDFS, but it only works for .bz2 (bzip2) archives
http://bzip.org/ :
hadoop fs -copyFromLocal twitter_stream_seq8.txt.gz /example/data/twitter_stream_seq8.txtCopy
twitter stream seq8.txt from the
c: \ Apps \ dist \ example \ HDFS folder
with the following command:
c: \ Apps \ dist \ example \ data>
hadoop fs -copyFromLocal twitter_stream_seq8.txt /example/data/twitter_stream_seq8.txt1. Make sure the file on the HDFS is updated. To do this, open
![clip_image056 [4] clip_image056[4]](https://habrastorage.org/getpro/habr/post_images/ade/388/161/ade38816118c14ca96fb6b392bb2f62d.jpg)
and go to the folder / example / data.
![clip_image058 [4] clip_image058[4]](https://habrastorage.org/getpro/habr/post_images/107/7ab/566/1077ab5664490527f2f49daaf6c7744c.jpg)
The following steps are contained in the All steps to run from the Hadoop Command Shell.txt file that you copied to the head node.
Create and upload twitter_raw with the following command:
c: \ apps \ dist \ example \ data> hive -v -f load_twitter_raw.txt![clip_image060 [4] clip_image060[4]](https://habrastorage.org/getpro/habr/post_images/c45/c41/2d1/c45c412d1acdede1849f76f0f38ca4d8.jpg)
The table will be created in the / hive / warehouse directory on the HTFS side:
![clip_image062 [4] clip_image062[4]](https://habrastorage.org/getpro/habr/post_images/82a/69f/a80/82a69fa80e0dd7db1c7f3c12d4b3bc83.jpg)
You can check this with Hive by typing
c: \ Apps \ dist \ example \ hive and
Hive> show tables; as shown below.
![clip_image064 [4] clip_image064[4]](https://habrastorage.org/getpro/habr/post_images/d6d/f2f/187/d6df2f1871d5dfc17e3bd15c2a9c5122.jpg)
To exit Hive, use the command
hive> quit; . Create and upload
twitter_temp as follows:
c: \ apps \ dist \ example \ data> hive -v -f create_twitter_temp.txtIf there are 4 nodes, this operation will take more than 20 minutes, and if there are 8 nodes - 8 minutes 55 seconds. Check the progress in the following window:
![clip_image066 [4] clip_image066[4]](https://habrastorage.org/getpro/habr/post_images/79c/f93/768/79cf93768b0d284612786918009de5ac.jpg)
Click a task to view details and progress. The operation may take more than 20 minutes.
![clip_image068 [4] clip_image068[4]](https://habrastorage.org/getpro/habr/post_images/52d/690/b3c/52d690b3cae2d6ac83707bcaabc34529.jpg)
You can also track task execution using the Hadoop Command Shell:
![clip_image070 [4] clip_image070[4]](https://habrastorage.org/getpro/habr/post_images/09e/913/ecf/09e913ecf8ca0a0c1acf2aab710f34c2.jpg)
You can check this with Hive by typing
c: \ Apps \ dist \ example \ hive and
Hive> show tables :
![clip_image072 [4] clip_image072[4]](https://habrastorage.org/getpro/habr/post_images/2c2/01b/683/2c201b683a14c7311b5f538b90a5d65b.jpg)
Create and upload
twitter_stream as follows:
c: \ apps \ dist \ example \ data> hive -v -f create_twitter_stream.txtIf there are 4 nodes, this operation will take more than 60 minutes, and if there are 8 nodes, it will take 31 minutes 54 seconds. Track progress as described above. Create and download the
twitter stream sample with the following command:
c: \ apps \ dist \ example \ data> hive -v -f create_twitter_stream_sample.txtTrack progress as described above. Create and upload
twitter_movies as follows:
c: \ apps \ dist \ example \ data> hive -v -f create_twitter_movies.txtTrack progress as described above. Create and download
twitter movies vw with the command:
c: \ apps \ dist \ example \ data> hive -v -f create_twitter_movies_vw.txtTrack progress as described above.
Configure Hive ODBC and Hive Panel in Excel to get Hive data
This section is taken from the
“Instructions and Frequently Asked Questions for an Apache Hadoop Based Service for Windows Azure,” which is on the download tile.
![clip_image074 [4] clip_image074[4]](https://habrastorage.org/getpro/habr/post_images/4b9/112/1ed/4b91121ed06611fa397399fdeb4885e3.jpg)
in Hadoop on the Windows Azure portal.
![clip_image076 [4] clip_image076[4]](https://habrastorage.org/getpro/habr/post_images/bc8/9ae/8bd/bc89ae8bd25251b29fc875856d160923.jpg)
![clip_image078 [4] clip_image078[4]](https://habrastorage.org/getpro/habr/post_images/747/371/96f/74737196f94b43e4e18bfc0fd7fa460b.jpg)
From there you can also download
HiveODBCSetup for 64-bit and 32-bit versions of Excel.
How to connect to the Hive add-in for Excel in Hadoop on a Windows Azure platform using HiveODBC
The most important feature of the Microsoft big data processing solution is the integration of Hadoop with the components of Microsoft's business intelligence. A good example of this is to connect Excel to the Hive data storage framework in a Hadoop cluster. This section shows how to use Excel through the Hive ODBC driver.
Installing the Hive ODBC driver
To start the installation, download the 64-bit version of the Hive ODBC driver (MSI file) from Hadoop in the Windows Azure portal. Double-click
HiveODBCSetupx64.msi to start the installation. Read the license agreement. If you agree to its terms, click
I accept and then
Install .
![clip_image080 [4] clip_image080[4]](https://habrastorage.org/getpro/habr/post_images/f69/137/b5f/f69137b5fc968b48df81150751459902.jpg)
After the installation is complete, click
Finish to exit the wizard.
Installing Hive Add-in for Excel
To install this add-in, you need 64-bit versions of the Hive ODBC driver and Excel 2010 software. Start the 64-bit version of Excel 2010. The system will offer to install the
HiveExcel extension. Click
Install . When the extension is installed, click the
Data tab in Microsoft Excel 2010. The Hive panel opens, as shown in the following screen shot:
![clip_image082 [4] clip_image082[4]](https://habrastorage.org/getpro/habr/post_images/4a8/028/666/4a802866620be7f5673d9131c6e772aa.jpg)
Creating a Hive ODBC Data Source for Excel
Select
Start ->
Control Panel to start the Microsoft Windows
Control Panel . In the control panel window, select
System and Security ->
Administrative Tools ->
Data Sources (ODBC) . The
ODBC Data Source Administrator dialog box
appears .
![clip_image084 [4] clip_image084[4]](https://habrastorage.org/getpro/habr/post_images/2cc/5f8/4cf/2cc5f84cf1e5ea076a031f27aa52e11f.gif)
In the
ODBC Data Source Administrator dialog box, select the
System DSN tab. Click
Add to create a data source. Select the
HIVE driver in the list of ODBC drivers.
![clip_image086 [4] clip_image086[4]](https://habrastorage.org/getpro/habr/post_images/d86/281/ee3/d86281ee312178a2a2ac2f8a9262ba9f.gif)
Click
Finish . The
ODBC Hive Setup dialog box appears, as shown in the screenshot below.
![clip_image088 [4] clip_image088[4]](https://habrastorage.org/getpro/habr/post_images/2a7/59b/8c0/2a759b8c0faaf30804b6d7b381899b8d.gif)
Enter a name in the
Data Source Name field. For example,
MyHiveData . In the
Host field, enter the name of the cluster node created in the portal. For example,
myhadoopcluster.cloudapp.net . Specify the username for authentication on the portal.
Click
OK to save the Hive data source. Click
OK to close the
ODBC Data Source Administrator dialog box.
Getting Hive Data in Excel
Start the 64-bit version of Excel 2010. Then go to the
Data tab. Click the
Hive Panel to open the Hive panel in Excel. In the
Select or Enter Hive Connection drop-down list, specify the name of the data source created earlier.
The system will ask you to enter a password for authentication in the cluster on the portal. Enter your password and username. In the
Select the Hive Object to Query drop-down list, select
hivesampletable [Table] . Check all columns in the table. The
Hive Query panel should look something like this:
![clip_image090 [4] clip_image090[4]](https://habrastorage.org/getpro/habr/post_images/f63/ed7/c7b/f63ed7c7ba3ad96cc2caf901d53898f6.gif)
Click
Execute Query .
![clip_image092 [5] clip_image092[5]](https://habrastorage.org/getpro/habr/post_images/b62/3d8/d4f/b623d8d4f26855ec93770f4184c658ef.gif)
To process the data from our example, you need to perform the following query:
Select * from twitter
movies vw limit 20
findings
In this tutorial, we looked at how to request, explore, and analyze data from Twitter using Hadoop on Windows Azure and Hive query in Excel.