📜 ⬆️ ⬇️

How to love machine learning and stop suffering

Our future is becoming increasingly associated with the development of artificial intelligence. Someone thinks that this is the end of the era of mankind, and someone sits down, goes through the courses and saws the code to deal with machine learning. I belong to the second category. At one time, when I thought about mastering this science and began to take the first courses, I wanted to give up. The complexity of the materials and the suffering seemed to be no limit. Now, from the height of my experience, I understand that all this could have been avoided. Therefore, under the cut I want to share the basics of ML for beginners "without pain."




Any human feelings are alien to the numpy library


I took a Machine Learning course of my own free will. At the first lesson, we were told that the annual course of matan, as well as a basic understanding of Python, would be enough for a normal course mastering. Sounds wonderful! Until then, until it comes to the realization that the program at different universities - is different. Linear algebra, discrete mathematics, asymptotic analysis, and so on inadvertently fell into the phrase “annual course of matan”.
')
I was also mistaken about the basic understanding of Python. “This is one of the easiest languages!” - asserted everything around. You can read good code as fiction. And I believed my acquaintances, because, without writing anything in Python before, I myself often preferred to read the implementation of different algorithms on it. After all, he is so laconic and perfectly conveys the essence.

The only problem is that reading the code and writing yourself are different tasks. Immediately begin to write good code in Python - this is a big problem (sorry for the captaincy).

After long sufferings and attempts to master the first dynamic interpreted language in my life, I had a moment of happiness and pride. As you may have guessed, the moment was short. Almost immediately, it came to me that this was not enough. Having learned to write at least somehow on python, it is necessary to relearn in order to correctly use machine learning libraries. I heard that for many it does not cause any particular difficulties. Someone may not even notice this stage. But it was very hard for me at first to master the standard collections at a fast pace and love them with all my heart, and then find out that ML libraries have their own opinion on this matter. They are completely uninterested in how cute and useful lists and dictionaries are convenient and easy to use. Any human feelings are alien to the numpy library.

As you already understood, the course was very difficult for me. I could hardly pass the first part of the course, having received a rating of "satisfactory". The course consisted of 2 parts and was designed for a year, but I decided not to hurt myself even more. I have an extremely depressing opinion about machine learning as a whole. I quite seriously decided that it was just not for me.

However, like all other wounds, this one over time dragged on. Recently, I increasingly began to read various articles about how people conquer new heights with the help of different methods of machine learning. The wonderful riddles of galaxies or the vital issues of medicine - we have a chance to get closer to solving them just by teaching the computer to think in the right direction. This thought does not give me peace, and so I decided to try again, filling me with motivation at the same time.

Where to begin


If you are at the beginning of my journey, start simple. For the first attempts do not need deep knowledge in mathematics. When I got to Microsoft, it came as a surprise to me that today you can not even be able to write code to learn ML. Let's go through a common path, at the same time we will find a basic solution for a simple task.

Make yourself an account on Azure ML Studio. There is a free quota, without binding a bank card, for several attempts. All the algorithms and the necessary procedures are implemented for us, and what is even more cool is that everything will work quickly even on a weak laptop. All calculations occur in the cloud.

You can fill in your data, but for a start, the proposed samples are perfect for us. I chose datasets about flight delays. Having a trained model, you can tell your friends, for example, that their flight may be delayed ... (Although, if I want to stay alive, I will need to come up with another way to use it)

To view available datasets, click on Datasets → Samples :



My chosen dataset is called Flight Delays Data .

Create your own experiment. To do this, click Experiment → New (at the bottom of the page) → Blank Experiment . By the way, the experiments also have a Samples tab, there you can explore ready-made models. But now it is more interesting for us to do everything ourselves.





The Azure ML platform pleasantly surprised me with its flexibility and unobtrusiveness. In this article I wanted to show that machine learning is available to everyone. The whole further process of work will look like “we chose the necessary blocks, threw them onto the work surface, connected them logically, started them, rejoice”.

If you feel confident, you can create your own blocks, for this you need to write a program code in Python or R. If you already have an army of trained models behind your back, then you probably are used to Jupyter Notebook and you can work with Azure ML through him.

Even if you are acutely allergic to the web interface, but you want to taste the pluses of the cloud - the developers took into account even this situation and made it possible to connect to Azure via the console. Read more here .

Let's return to our model. All the necessary blocks we will take in the menu on the left. They are conveniently divided into groups. During the first attempts, I advise you to look for what you need, while studying the next sections. But if you know the approximate name of the desired block, you can use the search.



Predict flight delay


The classic scenario of the machine learning algorithm is as follows:
  1. We find good data and make them even better. We clean from garbage, add useful information.

    Let me remind you, we chose data about flights.



    Drag the block on the working surface.



    It is necessary to better study and prepare the data. To do this, right-click on the exit from the block and select Visualize:



    We see a beautiful table.



    We can click on any column and see statistics for it.



    Studying the data, I found the columns DepDelay and DepDel15. They contain gaps, and so I decided to remove these columns.

    I plan to predict the binary sign - is it true that the plane will be more than 15 minutes late. The column ArrDel15 is responsible for it. In addition to it, there is also an ArrDelay column, which stores the late arrival time in minutes. Unfortunately, we have to remove it too, otherwise the experiment will not be entirely honest)

    To remove columns, select the Select Columns in Dataset block, connect it to the previous block, and then click on the Launch column selector button in the menu on the right.



    In the window that appears, select the desired columns.






  2. We divide the data into 2 parts - train and test. Our task is to forget about the test part for a while.
    Learn more about what train / test set is here . We will help block Split Data.



    Be sure to complete the circled fields on the right. The first - in what proportion to break - usually put about 0.7-0.8. The second is whether our partition is random. The check mark is already there: make sure that you do not accidentally remove it. It will also be nice to ask Random seed, read about it here .

  3. We give the train part to some machine learning algorithm.
    The most difficult thing to do for us. The choice of the algorithm is a delicate moment. I took Random Forest according to the light memory (neatly - here it was called Decision Forest). We can use any two-class classification algorithm.



    You can choose something else, get a better result and tell about it in the comments)

    We also need the Train Model block. It will be necessary to connect the blocks as shown in the screenshot below:



    In the Train Model block, we will also need to click on the Launch Column Selector and select the column that we want to predict - in our case, ArrDel15.

  4. The resulting model is checked using the test part.

    The Model Model block will help us to cope with this. Do not forget to connect to it also the second part of the data after splitting.

    The last block for today - Evaluate Model - will present us the result in a convenient form. The final graph looks like this:



    It's time to proudly press the "Run" button and go drink tea. Even for the cloud, learning is not the fastest process.

    If the tea is already finished, but the process has not been completed, I advise you to study a couple of materials that will help us read data about the quality of the learning outcomes of our model.

    We could see this if we did not remove the linearly dependent ArrDelay column from the data. The model predicts perfectly, she was not mistaken once. I saw it, let a mean tear of joy go and went to do the experiment again, to be honest)




    But I got this result after deleting the ArrDelay column. Worse, but it looks like the truth.




  5. Are you satisfied with the quality? Congratulations! Now you can take new objects from the real world, and the computer will predict everything necessary for them. I got a prediction accuracy of 80%, and this is not magic, but a great start.

  6. If the quality does not suit you - we return to the beginning of the task and look for what can be improved.

Of course, I maximally simplified the process. The art of preparing data, dividing it into parts, choosing an algorithm and measuring quality is honed over the years. Nevertheless, the fact “to assemble a model from scratch in just 10 minutes” gives me a second wind and enlivens a huge interest in this topic. And what if I take not the Random Forest, but the SVM? By the way, do you know the difference between these algorithms? Both have a huge mat.base inside and a rather complicated implementation, but everyone can understand the general idea. There would be a desire ;-) By the way, you can start by studying this cheat sheet .

I hope that my article will help you avoid suffering and fall in love with ML, as well as me. Share your opinion and experience in the comments, it will be interesting to talk!

If you would be interested in an article for beginners on a more specific topic, let us know in the comments, and I will try to share my experience in more detail. You can also read the article by Evgeny Grigorenko , in it you will find more practical scenarios aimed at more experienced users.

Source: https://habr.com/ru/post/325728/


All Articles