Home BigData. Part 1. Practice Spark Streaming on an AWS Cluster

Hello.

There are many services on the Internet that provide cloud services. With their help, you can master the technology BigData.

In this article, we will install Apache Kafka, Apache Spark, Zookeeper, Spark-shell on EC2 AWS (Amazon Web Services) platform at home and learn how to use it all.
')

Introduction to the Amazon Web Services Platform

Under the link aws.amazon.com/console you have to register. Enter the name and remember the password.

Configure node instances for Zookeeper and Kafka services.

Select "Services-> EC2" in the menu. Next, you need to select the operating system version of the image of the virtual machine, choose Ubuntu Server 16.04 LTS (HVM), SSD volume type, click "Select". Go to setting up the server instance: type "t3.medium" with the parameters 2vCPU, 4 GB of memory, General Purpose Click "Next: Configuring Instance Details".
Add the number of copies 1, click "Next: Add Storage"
We accept the default value for the disk size of 8 GB and change the type to Magnetic (in the Production settings based on the data volume and High Performance SSD)
In the "Tag Instances" section for "Name", enter the instance name of the node "Home1" (where 1 is just a sequence number) and click on "Next: ..."
In the "Configure Security Groups" section, select the "Use existing security group" option by selecting the name of the security group ("Spark_Kafka_Zoo_Project") and set the incoming traffic rules. Click on "Next: ..."
Review the Review screen to verify the entered values and and launch Launch.
To connect to the cluster nodes, you must create (in our case, use an existing) public key pair for identification and authorization. To do this, select the type of operation “Use existing pair” from the list.

Creating keys

Download Putty (https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html) for the client or use an SSH connection from the terminal.
The key file .pem uses the old format for convenience we convert it to the ppk format used by Putty. To do this, run the PuTTYgen utility, load the key in the old .pem format into the utility. We convert the key and save (Save Private Key) for later use in the home folder with the .ppk extension.

Cluster startup

For convenience, rename the cluster nodes to Node01-04 notation. To connect to the cluster nodes from the local computer via SSH, you must determine the node's IP address and its public / private DNS name, select each cluster node in turn, and for the selected instance write down its public / private DNS name for connection via SSH and for installation Software to text file HadoopAdm01.txt.

Example: ec2-35-162-169-76.us-west-2.compute.amazonaws.com

Installing Apache Kafka in SingleNode Mode on an AWS Cluster Node

To install the software, select our node (copy its Public DNS) to connect via SSH. We configure connection through SSH. We use the saved first node name to configure the connection via SSH using the Private / Public key pair “HadoopUser01.ppk” created in clause 1.3. Go to the Connection / Auth section using the Browse button and look for the folder where we previously saved the “HadoopUser.ppk” file.

Save the connection configuration in the settings.

Connect to the site and use login: ubuntu.

Using root privileges we update packages and install additional packages required for further installation and configuration of the cluster.
```
sudo apt-get update sudo apt-get -y install wget net-tools netcat tar 
```
Install java 8 jdk and check java version.
```
 sudo apt-get -y install openjdk-8-jdk 
```
For normal cluster node performance, you need to adjust the memory swapping settings. VM swappines is set to 60% by default, which means that when 60% of the memory is recycled, the system will actively start swapping data from RAM to disk. Depending on the Linux version, the VM swappines parameter can be set to 0 or 1:
```
 sudo sysctl vm.swappiness=1 
```
To save the settings upon reboot, add a line to the configuration file.
```
 echo 'vm.swappiness=1' | sudo tee --append /etc/sysctl.conf 
```
We edit the entries in the / etc / hosts file to conveniently resolve the node names of the kafka cluster and
zookeeper at private IP addresses assigned to the cluster nodes.
```
 echo "172.31.26.162 host01" | sudo tee --append /etc/hosts 
```
We check the correctness of name recognition by ping any of the records.
Download the latest current versions (http://kafka.apache.org/downloads) of the kafka and scala distribution kits and prepare the directory with the installation files.
```
 wget http://mirror.linux-ia64.org/apache/kafka/2.1.0/kafka_2.12-2.1.0.tgz tar -xvzf kafka_2.12-2.1.0.tgz ln -s kafka_2.12-2.1.0 kafka 
```
Delete the tgz archive file, we will no longer need it

Let's try to start the Zookeeper service, for this:

 ~/kafka/bin/zookeeper-server-start.sh -daemon ~/kafka/config/zookeeper.properties

Zookeeper starts with default startup options. You can check the log:

 tail -n 5 ~/kafka/logs/zookeeper.out

To ensure the launch of the Zookeeper daemon, after rebooting, we need to start Zookeper as a background service:

 bin/zookeeper-server-start.sh -daemon config/zookeeper.properties

To verify the launch of Zookepper we check

 netcat -vz localhost 2181

We set up Zookeeper and Kafka services for work. Initially edit / create the file /etc/systemd/system/zookeeper.service (file contents below).

 [Unit] Description=Apache Zookeeper server Documentation=http://zookeeper.apache.org Requires=network.target remote-fs.target After=network.target remote-fs.target [Service] Type=simple ExecStart=/home/ubuntu/kafka/bin/zookeeper-server-start.sh /home/ubuntu/kafka/config/zookeeper.properties ExecStop=/home/ubuntu/kafka/bin/zookeeper-server-stop.sh [Install] WantedBy=multi-user.target

Next, for Kafka, we will edit / create the /etc/systemd/system/kafka.service file (file contents below).

 [Unit] Description=Apache Kafka server (broker) Documentation=http://kafka.apache.org/documentation.html Requires=zookeeper.service [Service] Type=simple ExecStart=/home/ubuntu/kafka/bin/kafka-server-start.sh /home/ubuntu/kafka/config/server.properties ExecStop=/home/ubuntu/kafka/bin/kafka-server-stop.sh [Install] WantedBy=multi-user.target

Enable systemd scripts for Kafka and Zookeeper services.

 sudo systemctl enable zookeeper sudo systemctl enable kafka

Check the systemd scripts.

 sudo systemctl start zookeeper sudo systemctl start kafka sudo systemctl status zookeeper sudo systemctl status kafka sudo systemctl stop zookeeper sudo systemctl stop kafka

Let's check the performance of Kafka and Zookeeper services.
```
 netcat -vz localhost 2181 netcat -vz localhost 9092 
```
Check the zookeeper log file.
```
 cat logs/zookeeper.out 
```

First joy

Create your first topic on the collected server kafka.

It is important to use the connection to "host01: 2181" as you specified in the server.properties configuration file.

Let's write some data in the topic.

 kafka-console-producer.sh --broker-list host01:9092 --topic first_topic

Ctrl-C - exit from the console of the topic.

Now let's try to read the data from the topic.

 kafka-console-consumer.sh --bootstrap-server host01:9092 --topic last_topic --from-beginning

View the list of topics kafka.

 bin/kafka-topics.sh --zookeeper spark01:2181 --list

We edit kafka server parameters for adjustment under single cluster setup
# it is necessary to change the ISR parameter to 1.
```
 bin/kafka-topics.sh --zookeeper spark01:2181 --config min.insync.replicas=1 --topic __consumer_offsets --alter 
```
Restart the Kafka server and try to reconnect with the consumer

Look at the list of topics.

 bin/kafka-topics.sh --zookeeper host01:2181 --list

Configuring Apache Spark on a single-node cluster

We prepared an instance of the node with the Zookeeper and Kafka service installed on AWS, now you need to install Apache Spark, for this:

Download the latest version of the Apache Spark distribution.

 wget https://archive.apache.org/dist/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.6.tgz

Unzip the distribution and create a symbolic link for spark and delete unnecessary archive files.
```
 tar -xvf spark-2.4.0-bin-hadoop2.6.tgz ln -s spark-2.4.0-bin-hadoop2.6 spark rm spark*.tgz 
```
Go to the sbin directory and run the spark wizard.
```
 ./start-master.sh 
```
Connect using a web browser to Spark server on port 8080.
Run spark-slaves on the same node.
```
 ./start-slave.sh spark://host01:7077 
```
We start spark shell with the master on host01 host.
```
 ./spark-shell --master spark://host01:7077 
```

If the launch does not work, add the path to Spark in bash.

 vi ~/.bashrc #      SPARK_HOME=/home/ubuntu/spark export PATH=$SPARK_HOME/bin:$PATH

 source ~/.bashrc

Run the spark shell again with the master on host01.
```
 ./spark-shell --master spark://host01:7077 
```
A single-node cluster with Kafka, Zookeeper and Spark works. Hooray!

A bit of creativity

Download the editor Scala-IDE (on the link scala-ide.org ). We start and start writing code. Here I will not repeat, since there is a good article on Habré .

Help useful literature and courses:

courses.hadoopinrealworld.com/courses/enrolled/319237
data-flair.training/blogs/kafka-consumer
www.udemy.com/apache-spark-with-scala-hands-on-with-big-data

Source: https://habr.com/ru/post/452752/

All Articles