"Now he counted you" or Data Science from Scratch

Not so long ago, I talked about how I accidentally got acquainted with the concept of Data Science, thanks to courses from the Cognitive Class . Briefly summarizing the article I will say that I didn’t really learn anything from the course, but I was curious, so after a while I ran to the store and bought a book, which this material is about.

I do not know how appropriate it is to describe the possibility of learning from a printed tutorial in Habré, but in the end this hub is about the educational process in IT and therefore if you are wondering what this book can teach a complete beginner in Data Science and whether it is worth spending on this stage time and money, then you are welcome under cat.

Part 1. "I - this time" - a little about skills

I must say that before reading this book, my idea of the benefits of Data Science is not far removed from the title picture, borrowed from a favorite cartoon.
')
In order for the reader to come here to project my experience on himself, I will have to tell a little about my starting skills. So, like last time, the dossier remained almost unchanged:

In conjunction with the mat. analysis and statistics was not noticed;
Python programming skills;
Owns the knowledge of the existence of Data Science, has no practical skills.
Character resistant Nordic, not married.

Actually, why I decided to study this book, and share my impressions about it?
Just after the Cognitive Class courses, I decided to look at Kaggle and realized that even in the tutorial on solving the problem about the Titanic , I don’t understand the essence of almost all the techniques and definitions.

This book did not require any starting skills and promised a pleasant immersion into the world of data science. Do I now have the confidence that after reading the book I will be able to solve this problem with the Titanic? The answer is at the end of the article.

Part 2. “Two is the Calf” - general information about the book

The book "Data Science. The science of data from scratch "- in the domestic market seems to have appeared quite recently, as evidenced at least by the fact that I could not download or buy its electronic version. The original itself was released in 2015. Of course, in 2 years in the IT world, a lot of things change, for example, new versions of libraries for analyzing data in Python come out. And here we must pay tribute to the author (Joel Gras) and the localizer of the book. Initially, the book was written with the expectation of Python 2, but the author did not abandon his brainchild and adapted the source texts of the programs, for Python 3 (and by the way laid it out on GitHub ), well, the translator, thank God, placed in the book already adapted texts of programs (it seems that with minor adjustments).

Thanks also to the translators for the brief installation instructions for Anaconda and / or environment settings for the case if you do not want to install Anaconda.

And so we begin the story of the book. On the back of the cover there is a quotation that really clearly describes the material placed in it: “Joel will give you a tour of data science. As a result, you will move from simple curiosity to a deep understanding of the vital algorithms that any data analyst should know. ”- Roit Sivaprasad. Well, at least the first part of this quotation is 100% correct, the book really resembles a tour, when you have to look at the Hermitage in 2 hours and all you have to do is run after the guide, catching a brief reference about each masterpiece. Oddly enough, I can not say that it is bad, at least you have time to read the book before it gets bored.

It should be noted that someone like me can be misled by the name of the book.
It is important to note that in this case “from scratch” means not from zero knowledge to some practical level, but the fact that all examples of functions for analysis and visualization will be written in the process of presentation of the material. This is reminiscent of the analogy with the book “Linux from scratch”, which is not designed so that you start using any Linux distribution “with interesting wallpapers” right away, but systematically, assembled your system from scratch (even if will not use).

This approach has its advantages and disadvantages. On the one hand, you are unlikely to continue to use those functions that you will borrow from the book, on the other hand, you may come to understand the general principles (I haven't come to a lot of places from the first time)

So, as representatives of power structures write in the reports: “essentially, I report the following:”

Part 3. “Three is Python” - content and general approach.

I must say that the book really handles the format of the “excursion”. It outlines, perhaps, almost all the basic concepts that can be found in other courses on Data Science (for example, at the same Coursera ). In short, there are advantages and disadvantages, on the one hand, if you wish, you can read a book in 2-3 days and it doesn’t have time to get bored, on the other hand you can really miss the material and “diagonally”, you can miss something, so that Something to understand will have to go back and reread the chapter again.

The author showed imagination and tied the material under study to your work in the conditional social network for scientists according to the data - “DataSciencester”. I must say that this is a pleasant approach, the tasks look from afar similar to everyday ones. And the complexity of the "working" tasks you solve gradually increases from chapter to chapter.

In the first chapter, the training starts right off the bat, the author will show you how to solve several conditional tasks using Python, for example, build a graph reflecting the number of friends in our “conditional” social network or identify and graphically display the relationship between work experience and wage level, for a data scientist.

The following will be described intensively in Python, you cannot call it redundant, but the author must be given his due, beyond what is given in Chapter 2, it doesn’t go much further here, so if you go once into basic data types and other concepts, then, in theory, the code presented in the book should not cause problems (although it caused me).

After the introductory part and the basics of Python, the remaining directions of the book can be divided into 3 parts:

Very brief basics mat. analysis and statistics;
Collection, processing, storage of data;
Machine learning (mathematical models and algorithms for data processing and prediction);

Fragment of the book and table of contents can be viewed on Ozon (not advertising), there is just the content and the first chapter.

From the text part, let's move on to the practical one. Above, there was a link to the author's page on GitHub , where the code presented in the book and the necessary data are placed.

In the localized version of the book there is a link to the archive with an adapted (Russified) version of the code, so as not to violate anyone's rights, I will refrain from placing it.
All code is presented in the form of sources in Python 2 and 3, as well as in the form of notebooks for Jupyter notebook. I must say, many thanks to this book, because thanks to her, I discovered Anaconda (a handy thing). In my opinion it is most convenient to experiment with the code presented in the book in the version of Jupyter notebooks (which is installed by default in Anacodne). Although, on the other hand, in essence, in the notebook, all the code is hammered into a single cell without breakdown and without separate text inserts, so it is rather a matter of taste than a clear advantage. By the way, if suddenly the root directory doesn’t suit you, where does Jupyter “see” files, then there’s really work advice (there are options for both Windows and Linux)

It should be noted that notebooks come with pre-prepared results so that you can watch them without running the code, but after restarting the calculations in some places you may need a little “dancing with a tambourine” in the form of installing libraries or some other trifles (for example, connecting to the services API).

I do not want to be unfounded, so I hope the author will not be offended if I demonstrate a piece of code from his book.

Here, for example, a code snippet dedicated to linear algebra (in order not to violate the rights of a translator, take the original from GitHub). In the book, this code is mixed with the presentation of the material, in a notebook and source code is in continuous form.

# -*- coding: utf-8 -*- # linear_algebra.py import re, math, random # regexes, math functions, random numbers import matplotlib.pyplot as plt # pyplot from collections import defaultdict, Counter from functools import partial, reduce # # functions for working with vectors # def vector_add(v, w): """adds two vectors componentwise""" return [v_i + w_i for v_i, w_i in zip(v,w)] def vector_subtract(v, w): """subtracts two vectors componentwise""" return [v_i - w_i for v_i, w_i in zip(v,w)] def vector_sum(vectors): return reduce(vector_add, vectors) def scalar_multiply(c, v): return [c * v_i for v_i in v] def vector_mean(vectors): """compute the vector whose i-th element is the mean of the i-th elements of the input vectors""" n = len(vectors) return scalar_multiply(1/n, vector_sum(vectors)) def dot(v, w): """v_1 * w_1 + ... + v_n * w_n""" return sum(v_i * w_i for v_i, w_i in zip(v, w)) def sum_of_squares(v): """v_1 * v_1 + ... + v_n * v_n""" return dot(v, v) def magnitude(v): return math.sqrt(sum_of_squares(v)) def squared_distance(v, w): return sum_of_squares(vector_subtract(v, w)) def distance(v, w): return math.sqrt(squared_distance(v, w)) # # functions for working with matrices # def shape(A): num_rows = len(A) num_cols = len(A[0]) if A else 0 return num_rows, num_cols def get_row(A, i): return A[i] def get_column(A, j): return [A_i[j] for A_i in A] def make_matrix(num_rows, num_cols, entry_fn): """returns a num_rows x num_cols matrix whose (i,j)-th entry is entry_fn(i, j)""" return [[entry_fn(i, j) for j in range(num_cols)] for i in range(num_rows)] def is_diagonal(i, j): """1's on the 'diagonal', 0's everywhere else""" return 1 if i == j else 0 identity_matrix = make_matrix(5, 5, is_diagonal) # user 0 1 2 3 4 5 6 7 8 9 # friendships = [[0, 1, 1, 0, 0, 0, 0, 0, 0, 0], # user 0 [1, 0, 1, 1, 0, 0, 0, 0, 0, 0], # user 1 [1, 1, 0, 1, 0, 0, 0, 0, 0, 0], # user 2 [0, 1, 1, 0, 1, 0, 0, 0, 0, 0], # user 3 [0, 0, 0, 1, 0, 1, 0, 0, 0, 0], # user 4 [0, 0, 0, 0, 1, 0, 1, 1, 0, 0], # user 5 [0, 0, 0, 0, 0, 1, 0, 0, 1, 0], # user 6 [0, 0, 0, 0, 0, 1, 0, 0, 1, 0], # user 7 [0, 0, 0, 0, 0, 0, 1, 1, 0, 1], # user 8 [0, 0, 0, 0, 0, 0, 0, 0, 1, 0]] # user 9

The author, as promised, writes all the functions from scratch and tries to explain their work, after which, at the end of each chapter, honestly informs you that there are certain libraries where this is better implemented.

The explanation of the process of developing functions closer to the end to an unprepared reader (for example, me) seems furious and somewhere after the middle of the book is already understood, not all the logic of the code (I think you will have to re-read once), but in this way you and the author see how to do with his own hands the next ~~fucking bicycles,~~ useful basic solutions in areas such as heap, its primitive analogue SUDB, all the basic functions and analysis models, neural networks, decision trees, generators text recognizers "captchas ". Even just a quick acquaintance with all this set may well develop your interest in the subject.

Part 4. “Hooray goat!” - Conclusion.

So, what we have in the dry residue?

Since at the moment all my knowledge about Data Science is limited to this book and courses from the Cognitive class (CC), then for a start I will compare with them.

I don’t know if the factor of the native language may be, perhaps, that, unlike the SS courses, the author signed the examples on the graphs at least normally, but in terms of a general idea, the book gave much more, with the same time expenditure (both there and 2 pure days), despite the lack of video, laboratory, exams, and so on. And even the absence of “certificates” and “badges” does not give CC advantages at all (for they are worthless).

Can a complete novice, something to understand about the main approaches in the field of data science? Rather yes than no. Will he be able to immediately do something worthwhile at the end of the book, no more than yes. Nevertheless, it will probably be a bad practice to apply for permanent work the examples that are indicated in the book, which means that it is necessary to learn the main libraries for data analysis (the author himself also speaks about this during the presentation of the material). And I can assume that it will be useful once to return to examples from scratch, when the hand on ready-made libraries is already stuffed.

Is a book helpful to a newbie? I think yes. Probably, if you imagine that your brain is debating with itself, then you can get something like “Overton windows”, that is, firstly, the very realization that you need to delve into some notion of the type of dispersion or regression, or neural networks, It seems unacceptable, but every time you quietly come to the conclusion that this is not so scary.

Therefore, as an excursion into the world of Data Science, the book is quite suitable, at least in the process of reading, interest in the issue only grows and I think that it will be much easier to take a closer look at the concepts previously studied with this book in more thorough training courses.

Is it worth it in the end 550 rubles a book of 300 pages with a small, printed on newsprint, you decide. I can say one thing, this book instilled in me the confidence that now I can somehow solve the problem about Titanic on kaggle, I think this will be my next material.

Source: https://habr.com/ru/post/331794/

All Articles

"Now he counted you" or Data Science from Scratch

Part 1. "I - this time" - a little about skills

Part 2. “Two is the Calf” - general information about the book

Part 3. “Three is Python” - content and general approach.

Part 4. “Hooray goat!” - Conclusion.

More articles: