Your personal course on Big Data

Hi, Habr!

After the publication of several articles on Big Data and Machine Learning , many letters from readers came to me with questions. Over the past few months I have been able to help many people make a quick start, some of them are already solving applied problems and are making progress. And someone has already got a job and is engaged in solving real problems. My goal is to have smart people around me, with whom I can work in the future as well. Therefore, I want to help those who really want to learn how to solve real problems in practice. The network has a large number of manuals on how to become a data scientist ( Data Scientist ). In due time I passed everything that is there. However, in practice, sometimes other knowledge is needed. I will tell you about what skills you need in today's article and will try to answer all your questions.

If you google How to become a Data Scientist , you can stumble upon a lot of pictures like this one or this one . In general, everything that is written there is really so. But, having studied all this, it is not guaranteed that you will succeed in solving real problems in practice. In general, you can go the way outlined in the images above - namely, to learn independently, and then go and solve real problems. You can do otherwise - go get a special education. At one time, I had the opportunity to go both ways - Coursera courses, the School of Data Analysis and many other courses at the university, including computer vision, web graph analysis, Large Scale Machine Learning, etc. I was lucky to learn from the best teachers - and go through the best courses that are available. But only after I began to apply this knowledge in practice, it came to be understood that the courses often do not pay due attention to practical problems, or they are not absorbed until you come across them. Therefore, I will try to set out a set of minimum skills that will be enough to start solving problems in practice as soon as possible.

Become a great mathematician

Yes, this is probably the most important thing - mathematical thinking, which must be developed constantly in itself from a younger age. For those who may have missed it, it is worth starting with courses in Discrete Mathematics - this is generally useful for all people who work in IT. All evidence and reasoning in further courses are based on this. I recommend to undergo the course of Alexander Borisovich Dainik, who once I listened in person. That should be enough. Here it is important to gain skills in working with discrete objects.
')
After you learn how to operate with discrete objects, it is recommended to get acquainted with the construction of effective algorithms - for this it is enough to take a small course on algorithms, such as the course of the SAD or after reading the review of known algorithms on e-maxx.ru - quite a popular site among ACM participants. It is enough to understand how to implement algorithms efficiently, as well as to know typical data structures and cases when to use them.

After your brain has learned to operate with discrete objects, and algorithmic thinking has developed, you need to learn to think in terms of probability theory. To do this, I recommend (at the same time refreshing knowledge in the field of discrete mathematics) to take the course of my supervisor Andrei Mikhailovich Raigorodsky , who knows how to encompass complex things "on the fingers." Here it is important to learn how to operate in terms of probability theory and know the basic concepts of mathematical statistics.

In general, this is not enough, but in practice it is enough to deal with discrete objects and operate with probabilistic values. It is also a good idea to have an idea about linear algebra, but, as a rule, in machine learning courses there are introductions to the necessary sections. Adding to this good programming skills, you can become a good developer.

Learn to write code

In order to become a good developer, of course you need to know programming languages and have experience writing good industrial code. For a scientist, according to the data, knowledge of, as a rule, scripting languages is enough, such things as templates or classes, exception handling, as a rule, are not needed, so you should not go deeper into them. Instead, it is good to know at least one scripting language oriented to scientific and statistical calculations. The most popular of them are Python and R. There are many good online courses in both languages. For example, this one is in Python or this one is in R — they provide basic knowledge sufficient for a data specialist. Here, first of all, it is important to learn how to work with data manipulation - this is 80% of the data scientist’s work.

Take basic machine learning courses

Once you have a good mathematical culture and programming skills, it's time to start learning machine learning. I highly recommend starting with the course Andrew Ng - because this course is still the best introduction to the subject. Of course, important common algorithms, like trees, are played through the course - but in practice, the theoretical knowledge gained in this course will be enough for you to solve most problems. After that, it is strongly recommended to start solving problems on Kaggle as soon as possible - namely, to start with the tasks from the Knowledge section - they have good tutorials in which tasks are sorted out - they are aimed at a quick start for beginners. After that, you can learn more about the remaining sections of machine learning and complete KVVorontsov’s machine learning course . Here it is important to get a holistic view of the tasks that may arise in practice, methods for solving them and learn how to implement their ideas in practice. It is also important to add that most of the machine learning algorithms are already implemented in libraries, such as scikit-learn for Python. I published an introduction to Scikit-Learn earlier .

Practice building algorithms.

Participate as much as possible in machine learning competitions — solve both simple classical problems and non-classical problems when, for example, there is no training set . This is necessary in order for you to get a variety of techniques and tricks that are used in tasks and help to significantly increase the quality of the algorithms . I talked about some practically important tricks earlier here and here .

After that, you are usually ready to build good algorithms and to participate in cash competitions Kaggle, however, as long as your ability is limited to working with small data that are placed in the RAM of your machine. In order to be able to work with big data, you need to get acquainted with the Map-Reduce computation model and tools used to work with big data.

Meet the big data

After you have learned how to build good models, you need to learn how to work with big data. First of all, you need to get acquainted with the methods of storing large data, namely the HDFS file system, which is included in the Hadoop stack, as well as the Map-Reduce computation model. After that, you need to familiarize yourself with the other components from the Hadoop stack — namely, how YARN works, how Oozie scheduler works , how NoSQL databases work, such as Cassandra and HBase . How data is imported into a cluster using Apache Flume and Apache Sqoop . The network has so far few courses on these sections, the most complete reference book is the book Hadoop: The Definitive Guide . Here it is important to understand the features of the interaction of all Hadoop components, as well as methods of storing and computing on big data.

Get to know modern tools.

After studying the Hadoop technology stack, you need to familiarize yourself with the frameworks that use the Map-Reduce paradigm and other tools that are used for big data computing. I described some of these tools earlier. Namely - get acquainted with the recently gaining popularity of Apache Spark , which we have already considered here , here and here . In addition, it is recommended to get acquainted with alternative tools that you can work with even without a cluster - this is a tool that allows you to build linear models (teaching them online, without putting a training sample into RAM) Vowpal Wabbit , which we reviewed earlier . Also, it is important to learn simple tools from the Hadoop stack - Hive and Pig , which are used for simple data operations in a cluster. Here it is important to learn how to implement the machine learning algorithms you need, as you did before using Python. The difference is that you are now working with big data using a different model of computation.

Explore Real-Time Big Data Processing Tools and Architecture Issues

Often you want to build systems that make decisions in real time. Unlike working with accumulated data, there is its own terminology and computation model. It is recommended to get acquainted with the tools of Apache Storm , which is based on the assumption that the unit of information being processed is a transaction, and Apache Spark Streaming - which has the idea of processing data in small chunks ( batch 's). After that, any reader will have a question - how does the cluster architecture look, in which part of the incoming data is processed online, and part - accumulated for further processing, how the two components interact with each other and what tools are used in each at each storage and processing stage data. For this, I recommend to get acquainted with the so-called lambda-architecture , which is described in detail on this resource. Here it is important to understand what happens at each stage with data, how they are transformed, how they are stored and how calculations are performed on them .

So, we have considered far from all the knowledge and skills that are required in order to understand how to work with Big Data in practice. But often in real problems in practice there are many difficulties that we have to work with. For example, an elementary sample may be missing a training sample, or part of the data may be known with some accuracy. When it comes to really huge data arrays, technical difficulties often begin here and it is important to know not only the methods of machine learning, but also their effective implementation. Moreover, tools that allow you to process data in RAM and often often need to try hard to properly cache them, or the known problem of small files of the same Apache Spark , are just emerging and developing - you have to deal with all this in practice!

Write me your questions

I repeat that publishing articles on Habré, I pursue the goal of preparing people for work in Big Data, in order to work with them later. Over the past few months I have been able to help many people make a quick start. Therefore, I really want to meet you and answer current questions, help start solving problems or help with the solution of existing ones. Then I will monitor your progress (if you do not mind) and help, if necessary. I will choose the best people and I will personally prepare for the next few months, after which, perhaps, I will have interesting offers for them!

I don’t know how many letters will come to the post office, I’ll just say that I’m going to answer late at night or at night, because I work during the day). I will try to answer as many letters as I can.

In addition to the goal of educating people, I also want to show that the processing methods of Big Data , about which marketers are so fond of telling, are not a “magic wand” with which you can work wonders. I will try to show which tasks are being solved well now, which ones can be solved if desired, and which ones are still difficult to solve. After your questions, I will write a big post in which I will post detailed answers. Let's develop Data Science together, because there are really not enough real specialists, and there are more than enough expensive courses .

Therefore, all those who would like to learn how to solve problems, regardless of your level of training - send me an email with the subject of Big Data on the e-mail ( al.krot.kav@gmail.com ), indicating:

Information about yourself: what is your name, what do you do, where do you work / study
Your experience: what they tried to teach themselves, what happened / did not work
Goals that you want to achieve: the most important point - without this I will not read the letter)
Your immediate question, if you already have one

I will wait for your letters!

Source: https://habr.com/ru/post/252743/

All Articles