Konstantin Budnik, EPAM: “Apache Hadoop has moved into the commodity phase - there’s almost nothing new.”

In early November, one of the key JavaDay 2017 conferences in Eastern Europe will be held in Kiev for the sixth time. Although there is still enough time before the event, we talked in detail with one of the conference speakers - Konstantin Budnik, Chief BigData Technologist and Open Source Fellow EPAM Systems - about the strength of open-source, Big Data and the future of Hadoop.

You have been working at Sun Microsystems for almost 15 years, then you worked on Hadoop for a long time. How did your career develop?

I worked at Sun since 1994, since the opening of an office in St. Petersburg - I was employee number 6. There I worked for 15 years and worked on everything from compilers to distributed and cluster systems. I worked on the operating system, on different parts of the Java stack - in particular, in the JVM team, participated in the development of a virtual machine, and created several frameworks for JVM developers.

If we consider open-source, I have been working with open technologies since about 1994. In particular, in the mid-2000s, I helped add Java to the Linux Software Foundation, which was created shortly before. Java has become part of LSB - Linux Standard Base. At the same time, I worked a little with Jan Murdoch, the creator of the Debian Linux distribution.
')
Starting in the 2000s, I started to work in distributed systems, and during my time at Sun I received about 15 US patents in the field of distributed computing and technology.

In 2009, I switched to Yahoo and started working on Apache Hadoop — I wrote code that went to the Apache project. In the first 3 years of working in the Hadoop project, I wrote more code than, at that time, IBM had. More specifically, I worked on HDFS — a distribution file system, for storing data in Hadoop, and a distributed fault-injection system. After that, some time passed, and for a year now I have been cooperating with EPAM.

What are you working there?

EPAM has a Big Data Practice department. Today it has about 300 experts in the field of processing “big data”. Some of them are involved in data science, and another part - in building architecture for processing large amounts of data. I am the Chief Technologist of this unit, and in addition, I lead the direction of open-source software development for the entire company. EPAM actively engages engineers to support open-source projects based on open systems and platforms with which it solves product development tasks.

It is noteworthy that the share of open source software in Big Data is about 94-95%. “Closed” software of commercial origin, which is used to solve problems of processing “big data”, is very small.

Why is that?

Businesses are increasingly confident that commercial software makers can - and do - take them hostage to their decisions. Investments in one particular technology result in a loss of flexibility of the entire technological base. Solving the problem by attracting additional technologies from other manufacturers leads to incompatibility of interfaces, data storage formats, implementation languages and other "symptoms".

In addition, falling into dependence on a single software vendor, the company makes it an "internal monopolist." The vendor, without consulting the client, raises the price of licenses or arbitrarily changes the set of technological capabilities of the software product. To get rid of such a vendor lock-in we need new investments: in changing equipment, software, and staff rotation.

Using open-source solutions ensures the absence of such problems. If you are working with Linux and Red Hat and these solutions no longer suit you, you can go to open and free CentOS, Fedora, Debian, or buy Canonical support for Ubuntu. From a business point of view, no major changes will occur. When it comes to handling business-critical data volumes, avoiding vendor lock-in becomes a must.

You are already twenty years in the open-source community. During this time, it changed the world thanks to Linux and Android. In addition, open-source technologies have become the basis for the development of products of large companies. What is the power of an open community that moves it forward? How do large companies properly build interaction with such communities?

The idea is quite simple: people are interested in working on what they are interested in working on . This principle is ingenious in its simplicity. Open-source allows people, even if on the main job they are doing something boring, to express themselves. A person comes to such a community and finds like-minded people, gets the opportunity to do something unique - for example, to change the direction of development of the entire project in the direction that he considers necessary. This is a very powerful incentive for creative and technological people.

Follow Linus Torvalds: a man really changed the world, because he is smart, persistent, productive, will carry people with his example and competence. This open-source compares favorably with some commercial developments: it is valued here what and how you do, and not how much and beautifully you say and promise. This principle is called meritocracy, and it is a great alternative to the traditional governance structure. It is actively used by GitHub and a number of other companies.

If you look at the fundamental difference between commercial development and open-source, then the first is always based on what the customer will buy. Business must be profitable. And if you are not a politician or CEO of Tesla, then you can make money only by offering people what they really need. At the same time, the engineer himself most often "does not see" the client and does not understand his needs - the technical marketing, sales, and other departments are involved in this. In this case, the developer often deals with a very specific direction of the product for quite a long time.

In open projects there are no such restrictions. The developers themselves are the first users of their systems, which are developed in an atmosphere of open collaboration and lack of boundaries between development teams.

Of course, there remains competition at the business level. There are different strategies for commercializing open-source projects. There are three main commercial players on the Hadoop platform market: Hortonworks (the Hadoop development team at Yahoo!), Cloudera, and MapR (very small). All the others who tried to build their distributions (IBM, Intel) eventually either went to one of the first two, or use Apache Bigtop to build their own platforms from the canonical code of Apache projects (Google DataProc and Amazon EMR).

But they sell these platforms in different ways to their customers. Cloudera adds some commercially closed components like the cluster management system over open-source. MapR, for example, added its file system, painfully reminiscent of NFS. Hortonworks, on the other hand, gives absolutely everything in the open - all their new developments go to Apache. By this they attract clients, demonstrating that everything is open to them and the client can take the code himself and continue to make his decision independently.

There is such a good expression: "There is no boss in open-source." The only requirement is for the community to accept your work. There are certain standards that need to be met: code quality, adherence to certain technological principles. As soon as the developer begins to meet these requirements, he can participate in any piece of the project and no one will stop him. At the same time, no one forces you to engage in this particular fragment for the rest of your life. This provides an opportunity to grow professionally, improve their skills and build up the technological base.

And how can you, as an open-source fellow in EPAM, tell you how many percent of the company's developers commit to any open-source projects?

It is rather hard to estimate the percentage, we still have more than 20 thousand developers. But over the past six months we have launched an interesting project, within which we purposefully konribyutim Apache Ignite, Apache Flink, Apache Zeppelin. For understanding - in the latest version of Apache Flink, out of 100 contributors, 10 people collaborate with EPAM. The company has about 20-30 product projects in open-source - from test report management, web projects, to the analysis of the human genome and the processing of Big Data in cloud technologies. We are building up expertise in strategically interesting areas of Big Data processing to develop products and platforms for our customers. You can view and participate in projects through github.com/epam .

If you go back to big data and Hadoop, you said that you are looking at what technologies are interesting to customers. What does this mean and how will Hadoop continue to evolve? Still, this is a fairly mature technology.

I may not say exactly what is expected from a person who has been developing components of the Hadoop ecosystem for many years - Hadoop has stabilized. The technology has become an "adult", spent. She began to recognize in large companies, began to use for internal infrastructure solutions. But the active software development cycle is not infinite. For a while, there is a surge in development, improvements, and then everything stabilizes. This phase, when stability comes, is called the commodity phase : technology becomes common and accessible to all. Hadoop went into this particular phase.

By the way, it's funny that even 11 years after the appearance of Hadoop, there remains for many a mantra: “you have to put it and then everything can be solved.” In fact, a sober calculation is needed - many problems are solved without distributed computing or massive cluster storage platforms. For example, the entire StackOverflow website lives on 4 servers with SSD disks: a master copy of StackOverflow itself, a master copy of the rest, and two replicas. And they serve 200 million requests per day. How many businesses in the world face similar volumes?

As soon as technology becomes a commodity, quality improvements will no longer appear in it. Hadoop initially consisted of a file system for storing data and a computer system for processing this data - MapReduce. It is little used today because of its inefficiency. Instead of MapReduce appeared TEZ, Spark. The file system survived, but nothing revolutionary appears there. The HDFS file system was made to create large data stores in data centers. Companies like Amazon, Facebook, Yahoo !, Google, mobile operators store a lot of data. But not everyone uses HDFS. For example, Google Spanner is a well-established, globally distributed file system. And more and more non-infrastructure companies have started to leave their data centers in the clouds of Amazon, Microsoft or Google. And those who remain in their data centers often use the Open Stack with the Ceph file system.

The base layer, which was originally Apache Hadoop, slowly begins to dissolve. New components, which initially worked on top of it, are beginning to switch to work in cloud technologies. This is still called Hadoop, although there are already more than 30 components, of which only two are the original Apache Hadoop. Hadoop has ceased to be the center of the universe. Apache Spark did not become one either - it just repeated the story of Hadoop on a slightly different level.

Now there is an active migration from Hadoop itself to the cloud area - and the abbreviation of the segment where Hadoop is used to store data. If you really need to, you can take a Spark cluster on AWS and not worry about HDFS. Most businesses and developers focus on data processing, business players have little interest in how they are stored. The speed of data exchange with cloud filesystem is lower than with HDFS. And even much lower if you are not afraid to use Microsoft solutions and go to Azure. But no need to spend money on their own specialists, system administrators, to buy iron, which becomes obsolete faster than breaks. The simple and effective solution wins .

For example, Amazon has worked a lot on the implementation of the self-serving system . Such a system does not require specialists for launching and servicing - any person can go into its console, “press a button”, raise a cluster to process data, download them and process. The skill set also becomes a commodity. The company does not need its own system administrator to service hundreds of computers, if there are 5 system administrators at Amazon, which serve 100 thousand servers. To run resources in the cloud and maybe software releases, you need Devops.

Computing power is also becoming a given - and people focus on what benefits can be derived from the data: it’s better to sell products in stores or optimize a marketing proposal by predicting customer behavior. The technological part of data processing is actively moving in the direction of predictive analytics - behavior modeling, for example, of buyers. We process historical data and build a model that tells us what might happen. The second actual direction is prescriptive analytics , when we process data volumes and conclude that in order to make 10% more sales, we need to change our marketing strategy in a certain way.

It is not 100% accurate to predict or model the future. But provided that such approaches allow raising the confidence interval of the result from 25% to 60%, this is a small victory, albeit a small one.

One of the interesting implementations of this approach is used in industry, where many complex units are involved - namely, the construction of analytics of the date and type of breakdown that will happen to them. One example is oil and gas production. Imagine an ocean platform for oil production - it consists of a million different components. If an arbitrary bearing breaks in the pump, the whole platform stops for a week. The item will be ordered, manufactured and delivered by helicopter from the continent. Instead, due to the analysis of information from the sensors, we can notice abnormal signs in advance, such as a pump temperature increase or vibration.

Using machine learning methods and analyzing historical data from equipment, we can theoretically predict with a confidence interval of, for example, 94%, that this bearing will fail in the next 4 days. Having ordered and received this bearing in advance, the production process can be planned to be suspended for 30 minutes, eventually saving millions.

All of these technologies are becoming increasingly available to non-programmers and increasingly targeted to the typical user. Due to the fact that data is becoming more and more, the speed of processing must also increase - otherwise the meaning of the whole game is lost. Therefore, it is increasingly possible to meet in-memory computing platforms that fully operate in computer memory. Among them, Apache Ignite, originally developed by GridGain, and Apache Geode, which came from Pivotal.

Distributed data processing in memory is receiving more and more attention from large technology companies. A surprising fact: computer memory has fallen in price over the past 20 years, almost 100,000 times. In nominal dollars, excluding inflation at almost 60% over the same time. It seems to me that this is one of the directions where many interesting and interesting things will happen in the coming years - we will talk about other trends and trends on JavaDay 2017 .

Source: https://habr.com/ru/post/332856/

All Articles