⬆️ ⬇️

“A train that could!” Or “Specialization Machine learning and data analysis”, through the eyes of a beginner in Data Science

Earlier in my last article on teaching Data Science from scratch, I promised to sign up for the Machine Learning and Data Analysis specialization at Coursera and share my impressions about the availability of this knowledge for an almost absolute newcomer in data science. No sooner said than done! Although of course, on Habré there are already references to this and similar specializations, but I think my “five kopecks” will not interfere.



The quote from the famous movie in the title of the article and the picture were not taken by chance, in some places it seems to me that this specialization gave me almost physical pain, and there was a tremendous desire to quit everything, but interest eventually took over. Therefore, if you are wondering how I passed this series of courses with the lowest possible financial costs - you are welcome under the cat.







Part 1. "Remember all ..." - a little about skills



I think at the very beginning it will be appropriate to recall how it all began, so that the reader can try on my experience on himself.

')

So, this article is the final in a spontaneously arising cycle of articles on how I mastered the basics of Data Science from scratch (the articles below are in order of appearance):





I started each of these articles with a brief description of my skills, since mastering the above materials totally fit in about one week (without taking into account the time of writing the articles), I cannot say that I had progressed very much, so by the beginning of training at Coursera my background was as follows:





It was from this base that I approached the start of training. The description of the specialization honestly says: “Intermediate Specialization. Some related experience required. ”And to admit it made me alarming, but since MIPT and Yandex are in the specialization developers, I decided to take a chance.



By the way, it should be noted that the course really made me remember everything, in particularly difficult moments suddenly began to emerge in my memory, the knowledge that had long passed by the ears, seemingly forgotten as useless. True, it seems to me, this specialization in terms of statistics and mat. analysis still put off in my head more than the subjects from the program of the specialty and magistracy of the university combined.



Well, I will not torment you more, let's move closer to the point.



Part 2 - "Start" - familiarity with the course



“What is the most enduring parasite? Bacterium? Virus? Intestinal worm? Idea. It is tenacious and extremely contagious. It is an idea to take possession of the brain, to get rid of it is almost impossible. I mean the formed idea, fully realized, settled in the head " - Inception



While writing this article and recalling the passage of courses, I thought that the only sensible reason why I became interested in Data Science may be only because I had this idea implemented in a dream, or even in a dream inside a dream in a dream inside a dream ...



And it was not just an idea - to complete a course on Data Science at Coursera, it was an idea to complete the course as quickly as possible because there wasn’t much money, and there wasn’t time to stretch for half a year.



If someone is not familiar with the new policy of Coursera, then now a subscription system operates on this course, namely, 7 free days of the trial subscription, then every month for a fee.

Specialization is designed for about 6 months. One month cost me 4 576 rubles (now it costs a little more).



Thus, the system gave 1 month + 1 week , and I decided that it was for this segment that I had to pass a specialization . Looking ahead, I will say that the task is quite feasible .



Let us proceed to the description of the program of specialization. It consists of 6 courses, five of them are theoretical, and the sixth is a course project (Capstone Project), access to it will open only after passing the first five. Courses are preferably held in direct order, of course, no one forces you, but they are highly recommended. If you decide to go through a specialization in a short time, then sometimes it will make sense to take a little course not in direct order (more on that later), but most likely it will “come around” and it will be necessary to return to the previously passed.



Five courses of specialization smoothly lead you to the possibility of self-application of knowledge, they are especially valuable in the relationship, but in principle, courses can be useful and separately. So some courses (or rather their parts) seemed to be made somewhat in isolation from the main context, but in any case, the general line can be traced and the requirements for your level of skill will gradually increase from course to course.



You start with the foundations of Python, the basics of the mat. analysis and probability theory, then consider learning with a teacher and without a teacher (from basic models from scikit learn to neural networks), then statistics, then practical application. In principle, it seems that this is a common approach to training in the field of Data Science.



It may be critical for someone that the course is sharpened for Python 2 , and I wouldn’t even advise from sin to import some things “from the future”, because in some tasks a “grader” is very sensitive and problems can arise for, for example, differences in libraries, including when using Python 3 (at least judging by the reviews on the forums).



In my opinion it is most convenient to configure Anaconda. If you already have an anaconda installed with your Python 3 environment, then do not be discouraged to set up the second Python 2 environment with ease (I installed conda using this manual ). It is installed under Windows and under Linux, under Mac OSX I have not tried it, but I think it is also installed without problems.



By the way, judging by the specialization forums, many people took this course using the Windows OS, I recommend just in case to roll a second Linux system , but this is certainly not necessary , although it may be useful.



I rolled myself Linux Mint second system, purely for this specialization, and did not regret. Subjectively, it seems to me that in some places the calculations under it are faster, there were also less problems with the installation of some libraries that were required during the course.



The first course for a beginner looks quite friendly: in their own way, the charismatic guys from MIPT and Yandex will tell you why this is necessary and will not scare you at first with furious tasks. But then, the level of frustration depends on your preparation. I, and some people on the forum, had cases when they couldn’t find a solution or a test all day, on the other hand, if you have the ability and a good “base”, then I think everything will be simple and clear.



For each course there is a session (about a month) in which it is meant to study it, the course consists of weeks, a week consists of 2-4 classes (usually), in each lesson (lesson) there is usually an optional material (lectures, trial tests ) and control material, tests with estimates, programming tasks, tasks for mutual verification, etc. Delivery of these evaluation materials is required to complete the course.



If you do not pass something on time, you are not fined for it, but if it is tied to other people, for example, mutual verification tasks, then difficulties may arise (everyone will run ahead and they will not have time to check your work). If you did not fit in one session, you can always switch to another, the result will be preserved.



A separate word must be said about the course lecturers and tasks. A large team worked on the course and, accordingly, there are advantages and disadvantages, most of the course are read by 4 key specialists, each one has its own specialization. It is clear from the lecturers that they are experienced and intelligent people, but it is difficult to get used to some of them at first. I will not disclose personalities, so as not to offend anyone. I just note that there are lecturers who just want to bow at their feet, because they try to chew on the material even to beginners, some lecturers can be a little nervous and cause a burning desire to commit violent acts of an aggressive nature at first. This is certainly my subjective reaction, caused by poor basic knowledge and personal perception. In any case, by the end of the specialization you get used to the manner of each of the lecturers and even somehow feel sorry to part with them in absentia.



The lecturers are also directly connected to the control materials of the lessons, you will notice that in someone else's lessons, the tasks (in general) are more subtle, for example, tests are simple as “three kopecks”, and in tasks for statistics and probability theory make you sweat pretty .

Well, separate programming tasks (and / or mutual checks) were also developed by different people, so in some cases the wording can cause feelings of complete incomprehension and hopeless panic in an unprepared person.



As for guest lecturers, they are literate people, well, and if they don’t like something, they don’t have enough time to get bored, and there aren’t very many such moments, as a rule, invited lecturers read material a little bit torn from the main context, but certainly useful for common development.



I do not want to go into the details of training for each of the courses, I think you all comprehend in the learning process I will turn to useful tips. Well and once again I will repeat there are articles on Habré about this specialization, for example, from the MIPT itself ).



Part 3 “Hitchhiker's Guide to the Galaxy” - what to do to avoid painful pain.



“Galaxy is a harsh thing. To survive in it, you need to know where your towel is. ” - Hitchhiker's Guide to the Galaxy



Below, I will try to paint a couple of moments that cost me hair dropped on the keyboard and sleepless nights, I hope it will save you a little, let it be your “towel”.



1. My big mistake is the lack of a structured approach to fixing the learning process. In some matters, I do not really use a “retired person” and popular good practitioners. Closer to the 4th course of specialization, I realized that from the very beginning it was necessary to start something like a Mind map (or any equivalent) . The main problem begins at the moment when the course ceases to lead you by the handle and requires that you return to the previously completed material and dig out the implementation of a function or a piece of theory discussed earlier. Do not rely on memory, it probably will fail you in some places. Thank God there are ways to compensate for the lack of a Mind map, but I still recommend that you somehow structure what you learn.



2. Also, despite the main message of the article, I do not recommend to specialize at a gallop like me. Yes, maybe 6 months is objectively a lot, but I think for three months it is quite comfortable conditions for a measured absorption of knowledge. Studying the course for the month + one week besides sleepless nights and the lack of a normal weekend will result in your brain probably not having digested what you learned. So, for example, I discovered a funny effect by the time when I was already taking the 4th course, I suddenly, unconsciously just doing something completely different, began to understand some of the moments from the first courses. By the time I completed the final project of specialization, I suddenly had an understanding of the very foundations of statistics from the 4th year in my head, apparently the brain needs time. As a partly to compensate for the lack of time to complete the course with a quick study, I recommend a couple of courses after starting the training to start reading in parallel, some tutorial on the topic. For example, I chose the book: A. Muller, S. Guido - “Introduction to machine learning using Python. A guide for data professionals ”- 2017. There is little theory there, but the material of the book clearly repeats the techniques mastered in the course.

Or else, as an option “Python and machine learning” Sebsatian Raska (suggested by Metsur )



3. Use course and slack forums , you will be surprised how many people are facing the same problems as you. Since I was in a hurry, I started almost every task right away from studying the topics on the forum related to the difficulties that arise during his decision. It’s not uncommon to see on the forum , straight chunks of code , or the format of the answers that the “grader” is waiting for, and in especially difficult cases, direct instructions from users who chewed on what the author of the task wanted from you (who apparently has difficulty in communication with non-specialists). Slack helped me out at the very last stage, when it was necessary to cooperate with people to mutually check assignments, the people in the sixth year are few and not to wait for a long time , it is useful to look for people who have already passed this stage and ask them for an assessment or vice versa to help with advice ( rules) to people who are catching up with you, so that they catch up with you faster and can evaluate the work. Also, a small “life hack”, if you don’t get enough assignments to evaluate fellow students and don’t want to wait, you can always search for links from people on the forum where people ask them to check (even a couple of months ago), though from a sense of solidarity I still advise you not to dwell on the three minimum necessary, valued works, but to help with checking more people. In addition to the forum, in the first courses it helps just to search the Internet, there you can easily find hints for solving your problems (for example, one of the tasks in the first courses is based on a scientific article that can be found and peep pieces of code), but then the Internet is less is useful.



4. Consider the following point, which for some may become a “stumbling block”. Just in case, I recommend in the tasks for checking by the "grader" to form answers through the functions of recording the file to Python, and not manually through a notebook, this will save you from the "invisible" characters that the system recognizes as an erroneous answer.



5. Record on session. Carefully estimate your progress. If you want to finish in a short time, then you have no right to wait in vain. Some tasks cannot be passed until the session begins, well, for example, you finished the 2nd course on the 14th, and the session is not the third course, it will only begin on the 21st, it means that you will not be able to take some of the tasks for 7 days (usually associated with mutual assessment). Therefore, it makes sense to sign up for the session a little earlier than you completed the last course.



I will give an example, let's say some course has already begun, but the first 3 weeks do not contain tasks with other users checking, then it makes sense to sign up for this session and then catch up than wait until the new session begins and until your fellow students reach the third week. The second example of one of the courses I had to sign up ahead of schedule, it turned out that I finished the second course and immediately enrolled in the fifth, quickly passed the task, assessed by users in the very first week and returned calmly to the third and fourth course in order. Thus, I did not lose the moment when people were ready to evaluate the work and then made up for lost time. Of the minuses, the first week of the fifth course then had to be taught again because everything flew out of my head.



6. Not everyone knows and obviously it does not seem to be spelled out, so just in case I will write - Coursera at least for the current moment on the Capstone project gives six months , that is, my subscription period (month + free week) expired 08/08/17, but how said support for access to the Capstone project will continue for six months from the start of the 6th course in my case until the end of January, because I started at the end of July. So knowing this, you can save your nerves.



7. Capstone project is divided into 4 branches, to complete the specialization, it is enough to go through one of them, and in some places the rating systems are not very fair. Well, for example, in the 5th task of the 1st project (identification of Internet users) it will be very difficult to achieve high marks due to the need to get to the top competition at the kagle, on the other hand, in the 5th task of the project on sentiment analysis, they suggest to write a primitive site parser, the task can be done in half an hour without even going into the previous tasks of the course, and getting a good grade is easier (as a result, the best ball will be taken into account). Thus, in some moments you can master your skills better, besides the main branch, perform more buildings and others, combine the pleasant with the useful.



8. Do not be lazy to write normal code and draw up a notebook well, I was in a hurry (well, I didn’t have enough knowledge) and my code was creepy, I can hardly disassemble it myself, for other people it is also hard ( it sometimes affects user ratings ). I think it is not shameful to see how others are doing and to correct themselves a bit, without going beyond the border with plagiarism. I also recommend including text from the job descriptions in a notebook, I don’t remember now what exactly I did in each cell, and access to buildings is closed at the end of the subscription, so you don’t look. Although in reality many tasks are on GitHub, so this is not very critical.



9. Well, let it be trite, but calculate your strength , seriously, over the past month I sometimes had to sleep for 2 hours overnight, not see friends and relatives, forget to eat, to ruin the whole weekend to solve problems, and a lot more. Therefore, if you really want, without having special training to master the course for a month, consider whether you are ready for it.



By the way, the phrase from the Hitchhiker's Guide to the Galaxy was remembered by the fact that in the course they periodically suggest to set the random seed to value = 42



Well, I think it makes sense to sum up.



Part 4 "I'll be back" - conclusion.



Let's answer the questions in order:



1. Did the skills I gained from training before that fit me (see the first three articles of the cycle)?



- Yes, but not much, on the one hand, it’s good when you have an idea of ​​what is waiting for you (Courses from the Cognitive Class), and the tutorial about Data Science from scratch also came in handy there, I re-read probability theory (there’s some material but what is written is clearer), well, the experience with the kagle will also come in handy when you make the Capstone project, However, in sum, all three past articles on skills from the point of view of practice and do not come close with the passage of specialization, so if you have already firmly decided whatever you want, you can start "without preludes."



2. Have I suffered from a lack of basic knowledge?



- Yes, in some places very, especially when 2.5 days could not write the simplest function, or stupidly could not perceive some moments of statistics and probability theory. Fortunately, there is a forum and slack there are many of the same people, and you can find help, well, course mentors, and sometimes the developers themselves try to help. If everything is really bad, you can take a personal tutor, but I think that any person is able to cope on his own.



3. Did I learn something new?



- Yes, first of all, for the first time in my life I wrote a program that worked for 9.5 hours in a row, then I covered myself with a memory error (then, of course, I fixed it all up), I have a weak computer, but even toys with normal graphics could not compete with my creation in the part of devouring resources. This is a very good experience, I now forever remember the importance and benefits of discharged matrices. Well, secondly, there are other useful points as well: this course still teaches a little Python (y), I still know it very badly and have not mastered the “Pythonic way”, but this is much better than anything at all, the course explains the basic principles quite well higher mathematics and statistics (without going into details), in fact, I rediscovered them for myself. The course really shows many interesting pieces, some of which, if desired, can be transferred to your daily life. Yes, there are problems with the assimilation of information. I think 3/4 of the material went past my awareness, but even the rest is enough to guess in which direction to dig, if you need to analyze the data somewhere.



4. Can anyone take this course?



- I think yes, if only there was a desire, maybe not in a month, but everyone who knows for themselves what he really wants will definitely master.The contingent on the course was chosen different and young guys and girls and people in the age, as with a good knowledge of the material, and not so.



Even as a bonus, the authors on the site of specialization write about the possibility of assisting in finding employment after it has passed, I have not tried it yet, but the opportunity itself is pleasing.



To summarize, I definitely recommend specialization, many moments are still rough there, but I think in terms of price and quality - this is more than an acceptable option.



What next? Well, maybe I will apply the acquired skills to my hobby and then I will drop the material on Habr, maybe I will see how things are going with machine learning on .net and accomplish your goal too. But it will all be much later.



So I wish everyone good luck in mastering this interesting area of ​​knowledge!



Well, so that the article does not seem very serious catch the "bonus":



As a bonus



Another cool advantage of this specialization is that I learned the word correlation and now I will shove it everywhere to the place and not so much .



So, your letters and comments to past articles led me to the knowledge that my past articles in the cycle, read more easily and contain a little humor (well, I hope this is true), but judging by the sensations, this article is harder to read, Yes, and I wrote it with a serious mug peering at the monitor.



If you think so, you can find some correlation between how easy it was to learn from the materials in each of the articles in the whole cycle and the number of conditional “jokes” in the article.



Let's see, is there any kind of CORRELATION ?!



Let's calculate the ratio of the number of words in an article and the number of “jokes” in it, as well as the complexity of learning (the days spent on training).



Articles are numbered in the order I indicated them at the beginning of the article, this will be the fourth, respectively, when calculating the number of jokes and words, the bonus section was not included in the sample.



Under the jokes are meant at least some hints of humor (taking into account the pictures at the beginning of the article), the quotations in the headlines were not considered as jokes. So:



1. Article No. 1: words = 2575, jokes = 5, days of training - 2

2. Article No. 2: words = 2098, jokes = 3, days of training - 3

3. Article No. 3: words = 2667, jokes = 4, training days - 2

4. Article 4: words = 3051, jokes = 2, training days - 37



Next, the code in Python 3, for Python 2, remove the brackets before print and make sure you divide by float, you can also remove list () before zip ()



import pandas as pd humor_rate=[(5/2575),(3/2098), (4/2667),(2/3051)] days=[2,3,2,37] df=pd.DataFrame(list(zip(humor_rate, days)), index=None, columns=['Humor rate', 'Days of study']) print (' : \n', df) print ('  humor rate  days of study = ', df.corrwith(df['days of study'])[0]) 


conclusion:



:

.....Humor rate.....Days of study

0....0.001942............2

1....0.001430............3

2....0.001500............2

3....0.000656............37



humor rate days of study = -0.912343823382





Well, in the end? As a result, we have a pronounced negative CORRELATION (Pearson's correlation coefficient), which tells us that, as a rule, the smaller the number of days spent on learning, the more humor in the article.



Of course, this is a comic example of CORRELATION. There is certainly little data, and I also have difficulties with determining an unambiguous number of jokes in the article, but we will consider this a small example of how you can apply the skills obtained after specialization in practice, including for the calculation of CORRELATION.



PS How many times have I mentioned this word in the bonus fragment of the article? That's right - eight tailored, print () output.



Source: https://habr.com/ru/post/335214/



All Articles