📜 ⬆️ ⬇️

Lectures of the Technosphere. Semester 2 Methods for distributed processing of large amounts of data in Hadoop

We offer you a new course of lectures of Technosphere. It is an introduction to Hadoop, focusing on the design and implementation of distributed algorithms that can be applied in various fields: word processing, graphs, related data, etc. It also discusses the various components of the Hadoop platform and software models. The goal of the course is to introduce students to the Hadoop technology stack used to store, access and process large amounts of data. Course instructors: Alexey Romanenko, Mikhail Firulik, Nikolay Anokhin.

Lecture 1. Introduction to Big Data and MapReduce


What is “big data”. The history of this phenomenon. Necessary knowledge and skills for working with big data. What is Hadoop, where it applies. What is "cloud computing", the history of the emergence and development of technology. Web 2.0. Calculation as a service (utility computing). Virtualization Infrastructure as a service (IaaS). Concurrency issues. Manage a lot of workers. Data centers and scalability. Typical tasks Big Data. MapReduce: what it is, examples. Distributed file system. Google File System. HDFS as a GFS clone, its architecture.



Lecture 2. Hadoop basics


The history of Hadoop and its application. Data storage, Hadoop cluster. System principles. Horizontal scaling instead of vertical. The code to the data. Equipment failures. Encapsulation implementation complexity. Comparison with DBMS (RDBMS). Ecosystem Hadoop. Distributions, vendors, supported OS. Useful literature. Hadoop on Cloudera VM. Import and launch VM. Copying files to HDFS. Run MapReduce tasks in Hadoop. Check results.
')


Lecture 3. Distributed HDFS file system


Tasks for which HDFS is suitable and not suitable. HDFS daemons. Files and blocks. Replication of blocks. Clients, Namenode and Datanodes. Read and write file. Namenode: memory usage. Resistance to denials in Namenode. Access to HDFS, including through a proxy. Shell commands Copying data to the shell, deleting and statistics. Fsck command HDFS rights. DFSAdmin command. Balancer. File System Java API. Implementing a file system. Object Configuration. Reading data from a file and writing to it. Substitutions (globbing).



Lecture 4. MapReduce in Hadoop (introduction)


MapReduce workflow. Hadoop MapReduce and HDFS. Execution MapReduce. The architecture and operation of the first version of MapReduce. Hadoop API (types, classes). WordCount (Congigure Job, Mapper, Reducer). Reducer as Combiner. Data types in Hadoop. InputSplit, InputFormat, OutputFormat. Shuffle and Sort in Hadoop. Run and debug tasks. Hadoop streaming. Streaming in MapReduce.



Lecture 5. MapReduce in Hadoop (algorithms)


WordCount (baseline, In-mapper combining, mean, differing values). Cross-correlation (pairs, stripes). MapReduce relational patterns (Selection, Projection, Union, Intersection, Difference, Symmetric Difference, GroupBy and Aggregation, Repartition Join, Replicated Join, TF-IDF).



Lecture 6. MapReduce in Hadoop (columns)


Graph as a data structure. Tasks and problems on the graphs. Graphs and MapReduce. Adjacency matrix. Adjacency lists. Search the shortest path. Dijkstra's algorithm. Parallel BFS: algorithm, pseudocode, iteration, termination criterion, comparison with Dikester. BFS weighted: edges, criterion for completion, complexity. Graphs and MapReduce. PageRank: what it is where it applies. Calculation of PageRank, simplifications for it. PageRank on MapReduce. Full PageRank, convergence. Other classes of problems on graphs. The main problems for algorithms on graphs. Improved partitioning. Schimmy Design Pattern.



Lecture 7. Introduction to Pig and Hive


What is Pig, what is it used for? Pig and MapReduce. Key features. Components. Modes of execution. Run Pig. Pig Latin. DUMP and STORE operations. Large amount of data. LOAD command. Types of data for the scheme. Pig Latin (diagnostic tools, grouping, Inner and Outer bag, FOREACH, TOKENIZE function, FLATTEN, WordCount, Inner and Outer Join operator). Hive (architecture, interface, concept, table creation, data loading, query execution, Inner and Outer Join, WordCount).



Lecture 8. NoSQL, HBase, Cassandra


Scaling up. Scaling RDBMS (master / slave, sharding). What is NoSQL. Dynamo and BigTable. CAP theory. Consistency model. Eventual Consistency. NoSQL types. Key / Value. Schema-Less. What is HBase, when it is needed and when not to use it. HBase data model. Column Family. Timestamp. Cells Architecture and components of HBase. Key distribution in RegionServer. Data storage in HBase. Master and Zookeeper. Access to HBase. Column Family as a unit of storage. Request data from HBase. What is Cassandra. Typical NoSQL API. Data model Cassandra and consistency.



Lecture 9. ZooKeeper


What is ZooKeeper, its place in the ecosystem of Hadoop. Lie about distributed computing. Diagram of a standard distributed system. The complexity of coordinating distributed systems. Typical coordination problems. Principles embedded in the design of ZooKeeper. ZooKeeper data model. Flags of znode. Sessions Client API. Primitives (configuration, group membership, simple locks, leader election, locking without herd effect). ZooKeeper architecture. ZooKeeper DB. Zab Request handler



Lecture 10. Apache Mahout


What is Apache Mahout. Implemented algorithms. Classification (Naive Bayes, k-Means). Recommendations (collaborative filtering, Item-based, Slop One algorithm, Apache.teste, Item-based with Hadoop, Mahout with Spark, co-occurrence recommenders).



Lecture 11. Pregel's Computational Model


Web 2.0 and social graphs. Examples of graphs. Graph processing tasks. Means of processing large graphs. Pregel (what is it, concept). Vertex. Compute method Combiner Aggregator. Change the graph. Giraph (architecture, program execution, fault tolerance). PageRank. Shortest paths Performance.



Lecture 12. Spark


Motivation. Rdd. Spark software model. Higher-Order Functions. RDD transformation (Map, Reduce, Join, CoGroup, Union and Sample). RDD actions. SparkContext. Create RDD. Common variables (broadcast, accumulator). Apache Spark engine. Spark software interface. Lineage. Dependencies between RDD (Narrow, Wide). Task Schedule. Fault tolerance RDD. Memory management Applications that are suitable and not suitable for RDD.



Lecture 13. YARN


What is YARN, what is it for? YARN and the old MapReduce. MapReduce components on YARN. Perform MR task on YARN. Launch MapReduce tasks. Initialize the task. Initialization of MRAppMaster. MRAppMaster and Uber Job. Assignment of tasks. Memory management (creating containers for running tasks, controlling memory for each task, JVM heap, virtual memory). Performing tasks Status updates. Web interface Resource Manager. Failure of tasks. Crashes Application Master. Crashes Node Manager. Crashes Resource Manager. Task Schedule.



Lecture 14. Hadoop in Mail.Ru Search


Introduction history, Search components. Why Hadoop? Why HBase? Search robot (Old school, New generation H). What do we store in HBase? Work with Hadoop. Translation difficulties. Operation (Ganglia). Useful Hadoop and HBase lessons.



Previous issues


Technopark:

Technosphere:

Subscribe to the youtube channel Technopark and Technosphere!

Source: https://habr.com/ru/post/258045/


All Articles