As they say, it never happened, and here it is again. We thought and decided to put in free access the first laboratory work of our new Data Engineer program . Is free. Without SMS.
Earlier we wrote, why should we look at this profession ? Recently we interviewed one of these specialists, and in combination, our teacher .
So here. Potentially, everyone can independently go through this lab and feel a little bit like this engineer. For this will be all that is required.
And we will do the following in this lab.
Since the main task of any engineer date is to build a pipeline for processing and moving data (and this process requires the configuration of different tools), there is a need for each program participant to have their own cluster.
After analyzing various cloud platforms, we came to the conclusion that the best option at the moment for us would be the Google Cloud Platform . There at registration give $ 300, which can be spent during the year on any services. This should be enough for the entire program with careful use. In particular, you need to turn off the machine when they are not in use .
After registration, you will be asked to create your new project and name it. The name can be any. You can be original.
Go to the Metadata
section, then to the SSH Keys
tab. Here you can insert the value of your pub-key, then to go to any machine with its private-key. Here on GPC it is written in detail how you can create your key from scratch on MacOS and Windows. In the end, after adding the key, you should have something like this:
Next, you will need to go to the Compute Engine
section, and then to the VM Instances
subsection, where we will create 4 virtual machines for our cluster.
Machine type: 4 vCPUs, 15 GB memory
. Operating system: Ubuntu 16.04 30
.
Create 2 such cars in the europe-west1-b
and 2 more in the europe-west2-b
. Unfortunately, GCP has quotas for the number of CPUs in one region, which can only be changed if you do not have a free account. You can automatically enter them with the key you added earlier.
The next step is to reserve a static IP for your master node. This will require about $ 10 of those $ 300 for the entire program. Since you have to turn the machine on and off, it will be necessary for further convenience. To do this, go from the Compute Engine
section to the VPC Network
section. Next, the External IP addresses
tab. There in the list of your servers, find the one you are going to make a master node, and click on Ephemeral
. There you can reserve this IP as static.
Program participants will have to drop this IP to us, and we, in turn, will set our gun on their website, providing them with the necessary clicks .
A detailed manual on how to install HDP via Ambari is outlined here . We decided to put it in a separate document, because some of you can do this with half-closed eyes.
Important note. For participants, when building the entire data processing pipeline, a large number of different tools will be required. For directly this labs the most necessary components can be enough: HDFS, YARN + MapReduce2, ZooKeeper, Kafka.
Download the archive with a static site via this link to your server with a static IP. Unzip it along the path /var/www/dataengineer/
. Basically, you can parse any site. It is simply important for us that the participants of our program have the same version of the site.
The next step is to install nginx.
Run the following commands:
$ sudo apt-get install -y python-software-properties software-properties-common $ sudo add-apt-repository -y ppa:nginx/stable $ sudo apt-get update $ sudo apt-get install nginx
In order for your site to rise and be accessible from the browser, you need to create the following config in /etc/nginx/sites-enabled/default
.
server { listen 80 default_server; listen [::]:80; server_name _; root /var/www/dataengineer; location / { index index.html; alias /var/www/dataengineer/skyeng.ru/; default_type text/html; } location /tracking/ { proxy_pass http://localhost:8290/tracking/; } }
Now in the browser, try typing your ip in the browser line, and you will need to get on your copy of the site.
Great, we have a copy of the site raised, the cluster is deployed. Now, somehow, we need to organize the collection of the clickstream from this site to our cluster. For this task, we suggest using the Divolte tool, which makes it quite convenient to collect clicks and save them in HDFS or send them to Kafka. We will try both options.
Before installing this tool, we will need to install Java version 8.
Just in case, we’ll check that we really don’t have it.
$ java -version
If you see something like this, it means that you do not have it:
The program 'java' can be found in the following packages:
- default-jre
- gcj-4.8-jre-headless
- openjdk-7-jre-headless
- gcj-4.6-jre-headless
- openjdk-6-jre-headless
Try: sudo apt-get install<selected package>
To install Java, use the following commands:
$ sudo apt-get install python-software-properties $ sudo add-apt-repository ppa:webupd8team/java $ sudo apt-get update $ sudo apt-get install oracle-java8-installer
Next, add the path to Java in the environment:
$ sudo nano /etc/environment
There you need to insert the following line JAVA_HOME="/usr/lib/jvm/java-8-oracle"
and save the file.
Further:
$ source /etc/environment $ echo $JAVA_HOME
The result should be:
/ usr / lib / jvm / java-8-oracle
Check again:
$ java -version
java version "1.8.0_151"
Java (TM) SE Runtime Environment (build 1.8.0_151-b12)
Java HotSpot (TM) 64-Bit VM Server (build 25.151-b12, mixed mode)
Now we can go directly to Divolte .
Take the current version of this tool from here and download to your master.
Further:
$ tar -xzf divolte-collector-*.tar.gz $ cd divolte-collector-* $ touch conf/divolte-collector.conf
Navigate to the conf
folder. Rename the divolte-env.sh.example
to divolte-env.sh
. Edit it by adding there:
HADOOP_CONF_DIR=/usr/hdp/2.6.2.0-205/hadoop/conf
Now the queue is divolte-collector.conf
. There add the following:
divolte { global { hdfs { client { fs.defaultFS = "hdfs://node1.c.data-engineer-173012.internal:8020" } // Enable HDFS sinks. enabled = true // Use multiple threads to write to HDFS. threads = 2 } } sinks { // The name of the sink. (It's referred to by the mapping.) hdfs { type = hdfs // For HDFS sinks we can control how the files are created. file_strategy { // Create a new file every hour roll_every = 1 hour // Perform a hsync call on the HDFS files after every 1000 records are written // or every 5 seconds, whichever happens first. // Performing a hsync call periodically can prevent data loss in the case of // some failure scenarios. sync_file_after_records = 1000 sync_file_after_duration = 5 seconds // Files that are being written will be created in a working directory. // Once a file is closed, Divolte Collector will move the file to the // publish directory. The working and publish directories are allowed // to be the same, but this is not recommended. working_dir = "/divolte/inflight" publish_dir = "/divolte/published" } // Set the replication factor for created files. replication = 3 } } sources { a_source { type = browser prefix = /tracking } } }
This config will allow you to save clickstream on HDFS. Note that in fs.defaultFS
you need to add your server's FQDN.
To make it work, you need to do two things. The first is to create two folders on HDFS, which we specified in the config in working_dir
and publish_dir
. To do this, go under the user hdfs
:
$ sudo su hdfs $ hdfs dfs -mkdir /divolte $ hdfs dfs -mkdir /divolte/inflight $ hdfs dfs -mkdir /divolte/published
Change the rights to the divolte
directory so that other users have access to the recording:
$ hdfs dfs -chmod -R 0777 /divolte
The second thing is to add a script to all pages of your copy of the site. The script looks like this:
<script type="text/javascript" src="/tracking/divolte.js" defer async></script>
One way is to use sed
. For example, with this command you can add a script to the bottom of the index.html
page:
sed -i 's#</body>#<script type="text/javascript" src="/tracking/divolte.js" defer async></script> \n</body>#g' index.html
Important! Think about how this spread to all pages.
Simply putting *
will not help much, because there are subdirectories inside the directory with the site, and sed will swear at them. Read more about sed here . Or think of your way from scratch.
As soon as you solve this problem, you can run divolte
:
ubuntu@node1:~/divolte-collector-0.6.0$ ./bin/divolte-collector
You should see something like this:
ubuntu @ node1: ~ / divolte-collector-0.5.0 / bin $ ./divolte-collector
2017-07-12 15: 12: 29.463Z [main] INFO [Version]: HV000001: Hibernate Validator 5.4.1.Final
2017-07-12 15: 12: 29.701Z [main] INFO [SchemaRegistry]: Using builtin default Avro schema.
')
2017-07-12 15: 12: 29.852Z [main] INFO [SchemaRegistry]: Loaded schemas used for mappings: [default]
2017-07-12 15: 12: 29.854Z [main] INFO [SchemaRegistry]: Inferred schemas used for sinks: [hdfs]
2017-07-12 15: 12: 30.112Z [main] WARN [NativeCodeLoader]: Unable to load your native platform library for your platform ... using builtin-java class use
2017-07-12 15: 12: 31.431Z [main] INFO [Server]: Initialized sinks: [hdfs]
2017-07-12 15: 12: 31.551Z [main] INFO [Mapping]: Using built in default schema mapping.
2017-07-12 15: 12: 31.663Z [main] INFO [UserAgentParserAndCache]: Using non-updating (resource module based) user agent parser.
2017-07-12 15: 12: 32.262Z [main] INFO [UserAgentParserAndCache]: User agent parser data version: 20141024-01
2017-07-12 15: 12: 37.363Z [main] INFO [Slf4jErrorManager]: 0 error (s), 0 warning (s), 87.60016870518768% typed
2017-07-12 15: 12: 37.363Z [main] INFO [JavaScriptResource]: Pre-compiled JavaScript source: divolte.js
2017-07-12 15: 12: 37.452Z [main] INFO [GzippableHttpBody]: Compressed resource: 9828 -> 4401
2017-07-12 15: 12: 37.592Z [main] INFO [BrowserSource]: Registered source [a_source] script location: /tracking/divolte.js
2017-07-12 15: 12: 37.592Z [main] INFO [BrowserSource]: Registered source [a_source] event handler: / tracking / csc-event
2017-07-12 15: 12: 37.592Z [main] INFO [Server]: Initialized sources: [a_source]
2017-07-12 15: 12: 37.779Z [main] INFO [Server]: Starting server on 0.0.0.0:8290
2017-07-12 15: 12: 37.867Z [main] INFO [xnio]: XNIO version 3.3.6.Final
2017-07-12 15: 12: 37.971Z [main] INFO [nio]: XNIO NIO Implementation Version 3.3.6.Final
If not and there is an error about JAVA, then log in to the machine.
Now go to your site and click, go to different pages. Return to the terminal and press Ctrl + C. Now see if something appeared in the /divolte/published
folder on HDFS. If yes, then everything works for you, and you have learned how to collect the clickstream!
Your task now is to make it not going to HDFS, but to Kafka.
PS Yes, the lab seems to be places on the instructions. But this is especially necessary at the start of the program. This lab creates the foundation from which everyone will make a start in building their own pipeline data. I want to be sure that everyone has done everything right at the start.
Source: https://habr.com/ru/post/341022/