1st laboratory work of the program Data Engineer

As they say, it never happened, and here it is again. We thought and decided to put in free access the first laboratory work of our new Data Engineer program . Is free. Without SMS.

Earlier we wrote, why should we look at this profession ? Recently we interviewed one of these specialists, and in combination, our teacher .

So here. Potentially, everyone can independently go through this lab and feel a little bit like this engineer. For this will be all that is required.

And we will do the following in this lab.

Register on the cloud service.
Let's lift on it 4 virtual computers.
Deploy the cluster using Ambari.
We will lift the site on nginx on one of virtualok.
Add a special javascript to each page of this site.
Let's collect clickstream on HDFS.
Let's collect it in Kafka.

Laba 1. Raise your website and organize the collection of the clickstream in Kafka

1. Deploying Virtual Machines

Since the main task of any engineer date is to build a pipeline for processing and moving data (and this process requires the configuration of different tools), there is a need for each program participant to have their own cluster.

After analyzing various cloud platforms, we came to the conclusion that the best option at the moment for us would be the Google Cloud Platform . There at registration give $ 300, which can be spent during the year on any services. This should be enough for the entire program with careful use. In particular, you need to turn off the machine when they are not in use .

After registration, you will be asked to create your new project and name it. The name can be any. You can be original.

Go to the Metadata section, then to the SSH Keys tab. Here you can insert the value of your pub-key, then to go to any machine with its private-key. Here on GPC it is written in detail how you can create your key from scratch on MacOS and Windows. In the end, after adding the key, you should have something like this:

Next, you will need to go to the Compute Engine section, and then to the VM Instances subsection, where we will create 4 virtual machines for our cluster.

Machine type: 4 vCPUs, 15 GB memory . Operating system: Ubuntu 16.04 30 .

Create 2 such cars in the europe-west1-b and 2 more in the europe-west2-b . Unfortunately, GCP has quotas for the number of CPUs in one region, which can only be changed if you do not have a free account. You can automatically enter them with the key you added earlier.

The next step is to reserve a static IP for your master node. This will require about $ 10 of those $ 300 for the entire program. Since you have to turn the machine on and off, it will be necessary for further convenience. To do this, go from the Compute Engine section to the VPC Network section. Next, the External IP addresses tab. There in the list of your servers, find the one you are going to make a master node, and click on Ephemeral . There you can reserve this IP as static.

Program participants will have to drop this IP to us, and we, in turn, will set our gun on their website, providing them with the necessary clicks .

2. Installing Hortonworks HDP

A detailed manual on how to install HDP via Ambari is outlined here . We decided to put it in a separate document, because some of you can do this with half-closed eyes.

Important note. For participants, when building the entire data processing pipeline, a large number of different tools will be required. For directly this labs the most necessary components can be enough: HDFS, YARN + MapReduce2, ZooKeeper, Kafka.

3. Deploy your site

Download the archive with a static site via this link to your server with a static IP. Unzip it along the path /var/www/dataengineer/ . Basically, you can parse any site. It is simply important for us that the participants of our program have the same version of the site.

The next step is to install nginx.

Run the following commands:

 $ sudo apt-get install -y python-software-properties software-properties-common $ sudo add-apt-repository -y ppa:nginx/stable $ sudo apt-get update $ sudo apt-get install nginx

In order for your site to rise and be accessible from the browser, you need to create the following config in /etc/nginx/sites-enabled/default .

 server { listen 80 default_server; listen [::]:80; server_name _; root /var/www/dataengineer; location / { index index.html; alias /var/www/dataengineer/skyeng.ru/; default_type text/html; } location /tracking/ { proxy_pass http://localhost:8290/tracking/; } }

Now in the browser, try typing your ip in the browser line, and you will need to get on your copy of the site.

4. Install and configure Divolte

Great, we have a copy of the site raised, the cluster is deployed. Now, somehow, we need to organize the collection of the clickstream from this site to our cluster. For this task, we suggest using the Divolte tool, which makes it quite convenient to collect clicks and save them in HDFS or send them to Kafka. We will try both options.

Before installing this tool, we will need to install Java version 8.

Just in case, we’ll check that we really don’t have it.

 $ java -version

If you see something like this, it means that you do not have it:

The program 'java' can be found in the following packages:
default-jre
gcj-4.8-jre-headless
openjdk-7-jre-headless
gcj-4.6-jre-headless
openjdk-6-jre-headless
Try: sudo apt-get install <selected package>

To install Java, use the following commands:

 $ sudo apt-get install python-software-properties $ sudo add-apt-repository ppa:webupd8team/java $ sudo apt-get update $ sudo apt-get install oracle-java8-installer

Next, add the path to Java in the environment:

 $ sudo nano /etc/environment

There you need to insert the following line JAVA_HOME="/usr/lib/jvm/java-8-oracle" and save the file.

Further:

 $ source /etc/environment $ echo $JAVA_HOME

The result should be:

/ usr / lib / jvm / java-8-oracle

Check again:

 $ java -version

java version "1.8.0_151"
Java (TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot (TM) 64-Bit VM Server (build 25.151-b12, mixed mode)

Now we can go directly to Divolte .

Take the current version of this tool from here and download to your master.

Further:

 $ tar -xzf divolte-collector-*.tar.gz $ cd divolte-collector-* $ touch conf/divolte-collector.conf

Navigate to the conf folder. Rename the divolte-env.sh.example to divolte-env.sh . Edit it by adding there:

HADOOP_CONF_DIR=/usr/hdp/2.6.2.0-205/hadoop/conf

Now the queue is divolte-collector.conf . There add the following:

 divolte { global { hdfs { client { fs.defaultFS = "hdfs://node1.c.data-engineer-173012.internal:8020" } // Enable HDFS sinks. enabled = true // Use multiple threads to write to HDFS. threads = 2 } } sinks { // The name of the sink. (It's referred to by the mapping.) hdfs { type = hdfs // For HDFS sinks we can control how the files are created. file_strategy { // Create a new file every hour roll_every = 1 hour // Perform a hsync call on the HDFS files after every 1000 records are written // or every 5 seconds, whichever happens first. // Performing a hsync call periodically can prevent data loss in the case of // some failure scenarios. sync_file_after_records = 1000 sync_file_after_duration = 5 seconds // Files that are being written will be created in a working directory. // Once a file is closed, Divolte Collector will move the file to the // publish directory. The working and publish directories are allowed // to be the same, but this is not recommended. working_dir = "/divolte/inflight" publish_dir = "/divolte/published" } // Set the replication factor for created files. replication = 3 } } sources { a_source { type = browser prefix = /tracking } } }

This config will allow you to save clickstream on HDFS. Note that in fs.defaultFS you need to add your server's FQDN.

To make it work, you need to do two things. The first is to create two folders on HDFS, which we specified in the config in working_dir and publish_dir . To do this, go under the user hdfs :

 $ sudo su hdfs $ hdfs dfs -mkdir /divolte $ hdfs dfs -mkdir /divolte/inflight $ hdfs dfs -mkdir /divolte/published

Change the rights to the divolte directory so that other users have access to the recording:

 $ hdfs dfs -chmod -R 0777 /divolte

The second thing is to add a script to all pages of your copy of the site. The script looks like this:

 <script type="text/javascript" src="/tracking/divolte.js" defer async></script>

One way is to use sed . For example, with this command you can add a script to the bottom of the index.html page:

 sed -i 's#</body>#<script type="text/javascript" src="/tracking/divolte.js" defer async></script> \n</body>#g' index.html

Important! Think about how this spread to all pages.

Simply putting * will not help much, because there are subdirectories inside the directory with the site, and sed will swear at them. Read more about sed here . Or think of your way from scratch.

As soon as you solve this problem, you can run divolte :

 ubuntu@node1:~/divolte-collector-0.6.0$ ./bin/divolte-collector

You should see something like this:

ubuntu @ node1: ~ / divolte-collector-0.5.0 / bin $ ./divolte-collector

2017-07-12 15: 12: 29.463Z [main] INFO [Version]: HV000001: Hibernate Validator 5.4.1.Final

2017-07-12 15: 12: 29.701Z [main] INFO [SchemaRegistry]: Using builtin default Avro schema.
')
2017-07-12 15: 12: 29.852Z [main] INFO [SchemaRegistry]: Loaded schemas used for mappings: [default]

2017-07-12 15: 12: 29.854Z [main] INFO [SchemaRegistry]: Inferred schemas used for sinks: [hdfs]

2017-07-12 15: 12: 30.112Z [main] WARN [NativeCodeLoader]: Unable to load your native platform library for your platform ... using builtin-java class use

2017-07-12 15: 12: 31.431Z [main] INFO [Server]: Initialized sinks: [hdfs]

2017-07-12 15: 12: 31.551Z [main] INFO [Mapping]: Using built in default schema mapping.

2017-07-12 15: 12: 31.663Z [main] INFO [UserAgentParserAndCache]: Using non-updating (resource module based) user agent parser.

2017-07-12 15: 12: 32.262Z [main] INFO [UserAgentParserAndCache]: User agent parser data version: 20141024-01

2017-07-12 15: 12: 37.363Z [main] INFO [Slf4jErrorManager]: 0 error (s), 0 warning (s), 87.60016870518768% typed

2017-07-12 15: 12: 37.363Z [main] INFO [JavaScriptResource]: Pre-compiled JavaScript source: divolte.js

2017-07-12 15: 12: 37.452Z [main] INFO [GzippableHttpBody]: Compressed resource: 9828 -> 4401

2017-07-12 15: 12: 37.592Z [main] INFO [BrowserSource]: Registered source [a_source] script location: /tracking/divolte.js

2017-07-12 15: 12: 37.592Z [main] INFO [BrowserSource]: Registered source [a_source] event handler: / tracking / csc-event

2017-07-12 15: 12: 37.592Z [main] INFO [Server]: Initialized sources: [a_source]

2017-07-12 15: 12: 37.779Z [main] INFO [Server]: Starting server on 0.0.0.0:8290

2017-07-12 15: 12: 37.867Z [main] INFO [xnio]: XNIO version 3.3.6.Final

2017-07-12 15: 12: 37.971Z [main] INFO [nio]: XNIO NIO Implementation Version 3.3.6.Final

If not and there is an error about JAVA, then log in to the machine.

Now go to your site and click, go to different pages. Return to the terminal and press Ctrl + C. Now see if something appeared in the /divolte/published folder on HDFS. If yes, then everything works for you, and you have learned how to collect the clickstream!

5. Task

Your task now is to make it not going to HDFS, but to Kafka.

6. References for study

PS Yes, the lab seems to be places on the instructions. But this is especially necessary at the start of the program. This lab creates the foundation from which everyone will make a start in building their own pipeline data. I want to be sure that everyone has done everything right at the start.

Source: https://habr.com/ru/post/341022/

All Articles