"Write letters ..." or train to work with data on appeals of citizens to the government of Moscow (DataScience)

Greetings colleagues! It's time to continue our spontaneous mini-series of articles devoted to the basics of machine learning and data analysis.

Last time, we analyzed with you the problem of applying linear regression to the open data of the Moscow government, and this time the data is also open, but we had to collect them manually.

So, today we will raise a burning topic - citizens' applications to the executive authorities of Moscow; today we are waiting for you: a brief description of the data set, a primitive data analysis, applying a linear regression model to them, as well as a brief reference to training courses for those who does not understand anything from the article material. And of course there will be a space for independent creativity.
')
Let me remind you that our article is intended primarily for beginners in Python and its popular libraries from the field of DataScience. Ready? Then, you are welcome under the cat.

UPD 07.01.17: Attention was added to the project data on GitHub for two months, so the results have changed a bit. The article was written for version 1.0 available in project history on GitHub

To begin with, as usual, literally two words about the required level of training.
This material will be quite understandable to people who have read any popular self-instruction manual for machine learning and data analysis using Python, or have gone through a more or less sane on-line course. If you don’t know at all what to start with, you can immediately go to the end of the article or read materials from the first part of the cycle:

Other articles of the cycle

1 Learn the basics:

“ Catch data big and small! "- (Overview of Cognitive Class Data Science Courses)
" Now he counted you " or Data Science from Scratch
“ Iceberg instead of Oscar! "Or as I tried to learn the basics of DataScience on kaggle
“ “ A train that could! ”Or“ Specialization Machine learning and data analysis ”, through the eyes of a newbie in Data Science

2. Practice first skills

Our article will consist of the following sections:

This time, perhaps, we can do without styling headlines.

Part I : get acquainted with the data

To avoid wasting time, I’ll say right away that, as always, the dataset, as well as Jupyter Notebook notebook (Python 3) with an example, are posted on GiTHub

If you feel the strength you can immediately experiment. Well, we will continue.

The data for the recruitment is taken from the Official Portal of the Mayor and the Government of Moscow , the data are public and, when the reference is given, it is quite accessible for free processing. However, in contrast to the data that is available on the open data portal of the Moscow government , this information is presented in an inconvenient format for machine processing. (If someone writes a script for automated collection and boasts us - honor and praise).

The data had to be collected by hand, and as far as they are plausible, we leave on the developers' conscience, I will note only two things in one of the months the numbers on the diagram were not marked (by eye), and in another month one of the columns of the diagram looks suspicious, but even if there is any shortcomings, it won't hurt us.

In its original form, the data contains the following columns:

num - Index
year - year of recording
month - month of recording
total_appeals - the total number of hits per month
appeals_to_mayor - total number of appeals to the Mayor; res_positive - number of positive decisions
res_explained - the number of calls that were explained
res_negative - the number of calls with a negative decision
El_form_to_mayor - the number of appeals to the Mayor in electronic form
Pap_form_to_mayor - - the number of appeals to the Mayor on paper
to_10K_total_VAO ... to_10K_total_YUZAO - the number of calls per 10,000 population in various districts of Moscow
to_10K_mayor_VAO ... to_10K_mayor_YUZAO– the number of appeals to the Mayor and the Government of Moscow per 10,000 people in various districts of the city

The data collection period begins on 01.2016 and ends on 08.2017 (for the time being), thus there are 32 columns in total, and only 20 rows. The data is presented as a .csv file, the separator is a tab.

Part II : the simplest analysis

In principle, in the source data on the Mayor's portal there is some data analytics (for example, progressive totals), we will not repeat those things that we have there. And in general, since the article is intended for beginners, and I myself am one of them, for the sake of demonstration, we intend to simplify all the decisions that we will meet.

Therefore, at once I will make a reservation our goal is not perfect to predict anything, but just to train. Experienced readers can share their wisdom in the comments, for which we will be grateful.

For the analysis, we will use Jupyter Notebook (Python 3) and as usual, the entire code of the “notebook” is posted on GiTHub .

Fasten your seat belts, we go on an adventure through the waves of bureaucracy.

Load the necessary libraries:

#import libraries import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import numpy as np import requests, bs4 import time from sklearn import model_selection from collections import OrderedDict from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn import linear_model import warnings warnings.filterwarnings('ignore') %matplotlib inline

Then read the data, take a look at our table.

 #load and view data df = pd.read_csv('msc_appel_data.csv', sep='\t', index_col='num') df.head(12)

These are the data obtained for the first year.

Unfortunately, the entire label did not fit in, so that you could imagine it, after the ellipsis there are similar columns in other districts of Moscow.

Let's now see if there is any linear correlation between the first seven columns of the table (excluding the index).

Of course, you can check out more, but this is enough for us, it just fits on the screen.

 columns_to_show = ['res_positive', 'res_explained', 'res_negative', 'total_appeals', 'appeals_to_mayor','El_form_to_mayor', 'Pap_form_to_mayor'] data=df[columns_to_show] grid = sns.pairplot(df[columns_to_show]) savefig('1.png')

In the first row, we defined a set of columns that we will extract from our table (it will still be useful to us in the future).

Then, using the seaborn library, they built a diagram (if your PC thinks about it in the process, don't worry about it normally).

Well, the last line is to save our image to a file, in the same folder as notebook, you can delete it, and I need it to make it easier for you to prepare an article.

As a result, we get just such beauty:

What do we see in the diagram? For example, the fact that at the intersection of the number of "clarified" hits and the total number of hits, the points were lined up almost in a straight line, let's see this by looking at the numerical value of the correlation coefficient.

 print("Correlation coefficient for a explained review result to the total number of appeals =", df.res_explained.corr(df.total_appeals) ) print("Corr.coeff. for a total number of appeals to mayor to the total number of appeals to mayor in electronic form =", df.appeals_to_mayor.corr(df.El_form_to_mayor) )

Correlation coefficient for a explained review result to the total number of appeals = 0.830749053715
Corr.coeff. for a total number of appeals to mayor to the total number of appeals to mayor in electronic form = 0.685450192201

In the first line, using the corr method built into Pandas Data Frame, we have just calculated the correlation between these two columns. The correlation was quite high, but it is obvious, in most cases, as they say, “finds a scythe on a stone,” so on the one hand the treatment is not always literate, and on the other hand, any bureaucracy in most cases will try to “unsubscribe” . Therefore, it is not surprising that the clarified appeals are the most, and, as a result, it is not surprising that the more appeals, the more explanatory answers to them. That is what the correlation coefficient of 0.83 tells us.

In the second line of the code, we looked at the second pair - the total calls to the mayor and the number of calls sent to him in electronic form. There is also a dependency here, and if it were not for obvious outliers (which may well be as errors in the data, as well as atypical behavior of citizens), then surely the correlation coefficient could be even closer to unity.

In the original data there are charts of activity of the population in the districts of Moscow reflecting the number of citizens appeals per 10 thousand inhabitants. It looks like this:

We will count the same thing, but for the entire period.

To begin with, we will select the columns we need, we will leave you to independently study the manipulation of appeals to the mayor himself and concentrate on the total number of appeals.

 district_columns = ['to_10K_total_VAO', 'to_10K_total_ZAO', 'to_10K_total_ZelAO', 'to_10K_total_SAO','to_10K_total_SVAO','to_10K_total_SZAO','to_10K_total_TiNAO','to_10K_total_CAO', 'to_10K_total_YUAO','to_10K_total_YUVAO','to_10K_total_YUZAO']

Then we shorten the names of our fields to the names of the districts and build a bar chart.

 y_pos = np.arange(len(district_columns)) short_district_columns=district_columns.copy() for i in range(len(short_district_columns)): short_district_columns[i] = short_district_columns[i].replace('to_10K_total_','') distr_sum = df[district_columns].sum() plt.figure(figsize=(16,9)) plt.bar(y_pos, distr_sum, align='center', alpha=0.5) plt.xticks(y_pos, short_district_columns) plt.ylabel('Number of appeals') plt.title('Number of appeals per 10,000 people for all time') savefig('2.png')

What does the chart show us? The fact that the most upset district is central. And you know this most likely not because people there are "furious with fat." I have one little thought ...
Go to the website of the “My Street” program and, purely for the sake of curiosity, we look at where the work was done in 2017

The vast majority of work was carried out in the Central Administrative District, I do not think that this is the only reason for such a large number of applications per capita, but it is quite possible that one of them.

Part III : add data from the network

Well, here we are with you and came to the idea that our set itself is “boring” and does not have much, which explains. Let's expand it with data from the global network.
We will consider only two simple cases, and then you will have complete freedom of creativity.

To begin with, we will translate data from the previous graph from the scale of 1 circulation per 10,000 thousand people into the total number of references. In order to perform a reverse translation, we need to know the number of people living in all districts. This will help us to Wikipedia from which we copy the data in manual.

 # we will collect the data manually from #https://ru.wikipedia.org/wiki/%D0%90%D0%B4%D0%BC%D0%B8%D0%BD%D0%B8%D1%81%D1%82%D1%80%D0%B0%D1%82%D0%B8%D0%B2%D0%BD%D0%BE-%D1%82%D0%B5%D1%80%D1%80%D0%B8%D1%82%D0%BE%D1%80%D0%B8%D0%B0%D0%BB%D1%8C%D0%BD%D0%BE%D0%B5_%D0%B4%D0%B5%D0%BB%D0%B5%D0%BD%D0%B8%D0%B5_%D0%9C%D0%BE%D1%81%D0%BA%D0%B2%D1%8B #the data is filled in the same order as the district_columns district_population=[1507198,1368731,239861,1160576,1415283,990696,339231,769630,1776789,1385385,1427284] #transition from 1/10000 to citizens' appeal to the entire population of the district total_appel_dep=district_population*distr_sum/10000 plt.figure(figsize=(16,9)) plt.bar(y_pos, total_appel_dep, align='center', alpha=0.5) plt.xticks(y_pos, short_district_columns) plt.ylabel('Number of appeals') plt.title('Number of appeals per total pulation of district for all time') savefig('3.png')

The number of people in the district_population goes in the same order as our columns in the district_columns (see above). We get the following diagram.

Where to apply it? It is difficult to say, the data are clearly not enough, but you can dig in the direction of a more even distribution of the workload on the specialists of the district councils responsible for the appeals of citizens.

Now let's complicate the task a bit and try to pick up some data automatically. Fortunately, Python has decent libraries that can help us with this. We will choose one of the very popular Beautifulsoup.

 #we use beautifulsoup oil_page=requests.get('https://worldtable.info/yekonomika/cena-na-neft-marki-brent-tablica-s-1986-po-20.html') b=bs4.BeautifulSoup(oil_page.text, "html.parser") table=b.select('.item-description') table = b.find('div', {'class': 'item-description'}) table_tr=table.find_all('tr') d_parse=OrderedDict() for tr in table_tr[1:len(table_tr)-1]: td=tr.find_all('td') d_parse[td[0].get_text()]=float(td[1].get_text())

Let's see what it all meant. If you go to the source data page , you will see that I chose a very simple case as the data source for the example. The page has a table, which has three columns.
The third column does not interest us, but you can implement and process it for yourself.

Looking at the code of the table, we understand that it has the .item-description class, we get all the code of the table, then simply loop through each line of the table, stuffing the data into a dictionary, where the key is the date and the value is the price of oil. OrderedDict - we chose to make sure that the data in the dictionary will be presented in exactly the order in which we considered them.

 # dictionary selection boundaries d_start=358 d_end=378 # Uncomment all if grabber doesn't work #d_parse=[(" 2016", 30.8), (" 2016", 33.2), (" 2016", 39.25), (" 2016", 42.78), (" 2016", 47.09), # (" 2016", 49.78), (" 2016", 46.63), (" 2016", 46.37), (" 2016", 47.68), (" 2016", 51.1), # (" 2016", 47.97), (" 2016", 54.44), (" 2017", 55.98), (" 2017", 55.95), (" 2017", 53.38), # (" 2017", 53.54), (" 2017", 50.66), (" 2017", 47.91), (" 2017", 49.51), (" 2017", 51.82)] #d_parse=dict(d_parse) #d_start=0 #d_end=20

This piece of code, we need first of all, to ensure that the example works, if the source data page changes dramatically (or disappears altogether). In this case, it will be enough to uncomment the lines, after which the code below will allow us to continue to work with an example, as if nothing had happened.

 # values from January 2016 to August 2017 df['oil_price']=list(d_parse.values())[d_start:d_end] df.tail(5)

It remains only to create a column in the table, in which we will add our oil price and make sure that everything turned out as planned.

As you may have guessed, we did not just add the price of oil, let's see if there is any correlation between the new column and a couple of old ones.

 print("Correlation coefficient for the total number of appeals result to the oil price (in US $) =", df.total_appeals.corr(df.oil_price) ) print("Correlation coefficient for a positive review result to the oil price (in US $) =", df.res_positive.corr(df.oil_price) )

Correlation coefficient for the total number of appeals result to the oil price (in US $) = 0.446035680201
Correlation coefficient for a positive review result to the oil price (in US $) = -0.0530061539779

To be honest, I cannot explain how the price of oil is related to the total number of calls, I think that the presence of at least some correlation is largely a coincidence.

But the lack of correlation between the number of positive responses and the price of oil can be explained. Poor rank-and-file specialists in councils, GUPs (in which part of the appeals must go down), in the departments of the Moscow government and other concerned units, probably first of all think how to live from the advance to the paycheck and do not feel much monthly fluctuations in oil prices. This means that they do not get any kinder from the rise in oil prices, and, accordingly, the positive dynamics of oil prices does not make people want to “break into a cake” so that a bench is painted in your yard.

I think you can come up with a bunch of "crazy" combinations for analysis - go for it! In the meantime, we proceed to the linear regression promised earlier.

Part IV : linear regression - start

Once again, the network as a whole, and on Habré in particular, is full of decent materials about linear regression, so we will not complicate and consider only a couple of simple examples to demonstrate.

We will start with data preparation and the first thing to do is re-encode our categorical variable “month” in its numerical counterpart, we get 12 new signs, the code and the table will look like this.

 df2=df.copy() #Let's make a separate column for each value of our categorical variable df2=pd.get_dummies(df2,prefix=['month'])

So, for example, 1 in the month_May column means that earlier this month was just May

Go ahead.

 #Let's code the month with numbers d={'January':1, 'February':2, 'March':3, 'April':4, 'May':5, 'June':6, 'July':7, 'August':8, 'September':9, 'October':10, 'November':11, 'December':12} month=df.month.map(d) #We paste the information about the date from several columns dt=list() for year,mont in zip(df2.year.values, month.values): s=str(year)+' '+str(mont)+' 1' dt.append(s) #convert the received data into the DateTime type and replace them with a column year df2.rename(columns={'year': 'DateTime'}, inplace=True) df2['DateTime']=pd.to_datetime(dt, format='%Y %m %d') df2.head(5)

In the first part of the code, we create a match for the names of the months and their serial numbers, then since pd.get_dummies deleted our original “month” column, we will borrow it from the old table (df).

In the second part of the code, we glue and put the string data from the column the year, the variable month and just in case the first day of each month (although I think we could do without it).

In the third part, we rename the “year” column, and replace the data in it with datetime data, which is retrieved from our list in accordance with the pattern '% Y% m% d' (YEAR month number day of the month).

We get the following result:

True month, we now do not need, but then come in handy.
For now, let's continue to prepare data for our linear regression model.

 #Prepare the data cols_for_regression=columns_to_show+district_columns cols_for_regression.remove('res_positive') cols_for_regression.remove('total_appeals') X=df2[cols_for_regression].values y=df2['res_positive'] #Scale the data scaler =StandardScaler() X_scal=scaler.fit_transform(X) y_scal=scaler.fit_transform(y)

In the beginning, we determined which columns we would select from the table as attributes (x), and which as the objective function (y). We also threw out the total_appeals column from there, not that we didn’t like it much, but it is redundant; get a combination of the other three columns.

After we have decided on the data, we scale them just in case, in principle this can be not done, for this case it will not be fatal, the ridge regression could help us in some way with their equalization, but it is more convenient and better.

First, let's look at how almost all columns except months and oil prices help us determine the number of approved hits.

 X_train, X_test, y_train, y_test = train_test_split(X_scal, y_scal, test_size=0.2, random_state=42) #y_train=np.reshape(y_train,[y_train.shape[0],1]) #y_test=np.reshape(y_test,[y_test.shape[0],1]) loo = model_selection.LeaveOneOut() #alpha coefficient is taken at a rough guess lr = linear_model.Ridge(alpha=55.0) scores = model_selection.cross_val_score(lr , X_train, y_train, scoring='mean_squared_error', cv=loo,) print('CV Score:', scores.mean()) lr .fit(X_train, y_train) print('Coefficients:', lr.coef_) print('Test Score:', lr.score(X_test,y_test))

CV Score: -0.862647707895
Coefficients: [ 0.10473057 0.08121859 0.00540471 0.06896755 -0.04812318 0.04166228
0.0572629 -0.01035959 0.09634643 0.07031471 -0.02657464 0.02800165
0.03528063 0.02458972 0.06148957 0.04026195]
Test Score: -0.814435440002

Pay attention this time, we not only divided our sample into a training one and a control one using the train_test_split function, but also based on the training sample we evaluated the quality of the model’s work on cross-validation. Since there is very little data, we carried out cross-validation according to the following principle, each sample is taken as a control sample, and the others are trained, the prediction results are averaged. This assessment is not a panacea, but given that the control sample consists of only four points, it is a more reliable assessment of quality.

For the sake of curiosity, you can play around with the coefficient of regularization alpha and see how the score for the control sample and cross-validation will jump, or you can just indulge with the random-state parameter in train_test_split and see how the quality of the predictions jumps.

Since we are talking about the regularization coefficient, in contrast to the previous article where we used L1 (lasso) regularization, L2 (Ridge) regularization cannot completely remove little-useful features, but greatly reduces their weight due to coefficients. We have little data and it seems that in this case the model works better with a large regularization coefficient, not being retrained so much as a result.

What do these numbers mean? Is it good or bad? We will understand this a little later, but for now let's do the same thing, but add oil as an initial sign.

 X_oil=df2[cols_for_regression+['oil_price']].values y_oil=df2['res_positive'] scaler =StandardScaler() X_scal_oil=scaler.fit_transform(X_oil) y_scal_oil=scaler.fit_transform(y_oil) X_train, X_test, y_train, y_test = train_test_split(X_scal_oil, y_scal_oil, test_size=0.2, random_state=42) #y_train=np.reshape(y_train,[y_train.shape[0],1]) #y_test=np.reshape(y_test,[y_test.shape[0],1]) lr = linear_model.Ridge() loo = model_selection.LeaveOneOut() lr = linear_model.Ridge(alpha=55.0) scores = model_selection.cross_val_score(lr , X_train, y_train, scoring='mean_squared_error', cv=loo,) print('CV Score:', scores.mean()) lr .fit(X_train, y_train) print('Coefficients:', lr.coef_) print('Test Score:', lr.score(X_test,y_test))

CV Score: -0.863699353968
Coefficients: [ 0.10502651 0.0819168 0.00415511 0.06749919 -0.04864709 0.04241101
0.05686368 -0.00928224 0.09569626 0.0708282 -0.02600053 0.02783746
0.0360866 0.02536353 0.06146142 0.04065484 -0.02887498]
Test Score: -0.506208294281

As we see in the control sample, everything has changed for the better, but not on cross-validation. In this case, we will believe it.

So how to interpret this data. Let me remind you that our coefficient of the mean square error (deviation) is obtained on the basis of the data that is, if we did not scale them, its value would be greater. It is best to understand how our model predicts the control data to plot it.

 # plot for test data plt.figure(figsize=(16,9)) plt.scatter(lr.predict(X_test), y_test, color='black') plt.plot(y_test, y_test, '-', color='green', linewidth=1) plt.xlabel('relative number of positive results (predict)') plt.ylabel('relative number of positive results (test)') plt.title="Regression on test data" print('predict: {0} '.format(lr.predict(X_test))) print('real: {0} '.format(y_test)) savefig('4.png')

predict: [-1.22036553 0.39006382 0.46499326 -0.27854243]
real: [-0.5543026 0.23746693 0.41263435 0.44332061]

Well now, we clearly see how the predicted values differ from those that should be. For those who have not guessed, I will explain that in the ideal case all points (predicted values) would have to be on the green line (this is easy to check in scatter plot replacing lr.predict (X_test) with y_test.

As you can see, our model is not perfect, but what else to expect, we picked up the regularization coefficient by eye, didn’t take away any special signs, and the cat itself wept.
But we will not be sad if we buoy, but rather we will gather our strength and make the last spurt.

Part V : linear regression - trend forecast

Up to this point, we have taken random points for control, where it would be more logical to check whether our model can predict the number of positive results based on available data in the future. We did not just make a column with a date in the table.

Let me remind you that we have not used the data for months in our analysis. Not used and now we will not! First, let's see how our past model with oil prices will behave in other conditions.

 X_train=X_scal_oil[0:16] X_test=X_scal_oil[16:20] y_train=y_scal_oil[0:16] y_test=y_scal_oil[16:20] loo = model_selection.LeaveOneOut() lr = linear_model.Ridge(alpha=7.0) scores = model_selection.cross_val_score(lr , X_train, y_train, scoring='mean_squared_error', cv=loo,) print('CV Score:', scores.mean()) lr.fit(X_train, y_train) print('Coefficients:', lr.coef_) print('Test Score:', lr.score(X_test,y_test)) # plot for test data plt.figure(figsize=(19,10)) #trainline plt.scatter(df2.DateTime.values[0:16], lr.predict(X_train), color='black') plt.plot(df2.DateTime.values[0:16], y_train, '--', color='green', linewidth=3) #test line plt.scatter(df2.DateTime.values[16:20], lr.predict(X_test), color='black') plt.plot(df2.DateTime.values[16:20], y_test, '--', color='blue', linewidth=3) #connecting line plt.plot([df2.DateTime.values[15],df2.DateTime.values[16]], [y_train[15],y_test[0]] , color='magenta',linewidth=2, label='train to test') plt.xlabel('Date') plt.ylabel('Relative number of positive results') plt.title="Time series" print('predict: {0} '.format(lr.predict(X_test))) print('real: {0} '.format(y_test)) savefig('5.1.png')

CV Score: -0.989644199134
Coefficients: [ 0.29502827 0.18625818 -0.05782895 0.14304852 -0.19414197 0.00671457
0.00761346 -0.09589469 0.23355104 0.1795458 -0.08298576 -0.09204623
0.00742492 -0.03964034 0.13593245 -0.00747192 -0.18743228]
Test Score: -4.31744509658
predict: [ 1.40179872 0.5677182 0.1258284 0.38227278]
real: [ 0.53985448 0.23746693 0.35765479 0.84671711]

In the data considered in another data key, the model shows itself a little worse, perhaps the case also in the regularization coefficient, so that it would be clearer, I took it the same as for the following example (equal to 7).

Well, all that is left for us is to finally abandon oil and introduce our coded signs of the month.

 cols_months=['month_December', 'month_February', 'month_January', 'month_July', 'month_June', 'month_March', 'month_May', 'month_November', 'month_October','month_September','month_April','month_August'] X_month=df2[cols_for_regression+cols_months].values y_month=df2['res_positive'] scaler =StandardScaler() X_scal_month=scaler.fit_transform(X_month) y_scal_month=scaler.fit_transform(y_month) X_train=X_scal_month[0:16] X_test=X_scal_month[16:20] y_train=y_scal_month[0:16] y_test=y_scal_month[16:20] loo = model_selection.LeaveOneOut() lr = linear_model.Ridge(alpha=7.0) scores = model_selection.cross_val_score(lr , X_train, y_train, scoring='mean_squared_error', cv=loo,) print('CV Score:', scores.mean()) lr.fit(X_train, y_train) print('Coefficients:', lr.coef_) print('Test Score:', lr.score(X_test,y_test)) # plot for test data plt.figure(figsize=(19,10)) #trainline plt.scatter(df2.DateTime.values[0:16], lr.predict(X_train), color='black') plt.plot(df2.DateTime.values[0:16], y_train, '--', color='green', linewidth=3) #test line plt.scatter(df2.DateTime.values[16:20], lr.predict(X_test), color='black') plt.plot(df2.DateTime.values[16:20], y_test, '--', color='blue', linewidth=3) #connecting line plt.plot([df2.DateTime.values[15],df2.DateTime.values[16]], [y_train[15],y_test[0]] , color='magenta',linewidth=2, label='train to test') plt.xlabel('Date') plt.ylabel('Relative number of positive results') plt.title="Time series" print('predict: {0} '.format(lr.predict(X_test))) print('real: {0} '.format(y_test)) savefig('5.2.png')

CV Score: -0.909527242059
Coefficients: [ 0.09886191 0.11920832 0.02519177 0.20624114 -0.13140361 -0.02511699
0.0580594 -0.12742719 0.13987627 0.07905998 -0.08918158 0.00626676
-0.00090422 -0.01557178 0.0838269 0.00827684 0.04305265 -0.05808898
0.01884837 -0.06313912 0.04531003 0.1165687 -0.13590156 -0.29777529
0.03542855 0.12639045 -0.00721213 0.15110762]
Test Score: -0.35070187517
predict: [ 0.71512724 0.37641552 -0.10881606 0.71539711]
real: [ 0.53985448 0.23746693 0.35765479 0.84671711]

With the naked eye it is clear that, taking into account the factor of the month, the model better describes our data. Although I think, in terms of the quality of the prediction, there is still room to grow and grow.

If there are knowledgeable people, I will be glad if someone can make out the prediction on this data set using the Statsmodels library, I understand it poorly, but it seemed to me that for qualitative analysis with its help, the data would be “not enough!”

I could not resist without pictures :)

Part VI : if absolutely nothing is clear

This section is dedicated to - beginners, the rest I think can calmly browse through it.

In spite of the fact that we were sorting out the simplest things here, nevertheless someone could have come across the “DataScience” for the first time in my life. Therefore, if you liked this “magic”, but you almost did not understand anything about how to start “conjuring” yourself over the data. I can only send you ... to the first articles of the cycle :)

, , .
, , « », Coursera.

, over 150K «», « ». , . , DataScience .

– , , .

, IBM, Cognitive Class, Big Data University. , – « ». Python .

, («») Applied Data Science with Pytho n, .

, , , , , Data Analysis with Python , 5 , , . Anacond() .

, , , .

— , , , .

VII :

, . Kagle, , , .

, - , , , , . , - . , , « ».

!

, DC :)

Source: https://habr.com/ru/post/343216/

All Articles