Predicting the time to solve a ticket using machine learning

Making a ticket in the project management system and tracking tasks, each of us is happy to see the approximate timing of the decision on his request.
When receiving a stream of incoming tickets, the person / team needs to line them up in priority and time order that the solution of each call will take.
All this allows you to more efficiently plan your time for both parties.

Under the cut, I will talk about how I conducted the analysis and trained the ML models that predict the time of the decision of the tickets issued to our team.

I myself work for the SRE position in a team called LAB. We are flooded with requests from both developers and QA regarding deploying new test environments, updating them to the latest release versions, solving various problems that arise and much more. These tasks are quite heterogeneous and, which is logical, take a different amount of time to perform. There is our team for several years already and during this time a good database of references has accumulated. I decided to analyze this base and, on its basis, using machine learning, create a model that will be engaged in predicting the likely closing time (ticket).

In our work, we use JIRA, however, the model I present in this article has no reference to a specific product - the necessary information is not a problem to get from any base.

So let's move from words to deeds.

Preliminary data analysis

We load all necessary and we will display versions of the used packages.

Source

import warnings warnings.simplefilter('ignore') %matplotlib inline import matplotlib.pyplot as plt import pandas as pd import numpy as np import datetime from nltk.corpus import stopwords from sklearn.model_selection import train_test_split from sklearn.metrics import mean_absolute_error, mean_squared_error from sklearn.neighbors import KNeighborsRegressor from sklearn.linear_model import LinearRegression from datetime import time, date for package in [pd, np, matplotlib, sklearn, nltk]: print(package.__name__, 'version:', package.__version__)

 pandas version: 0.23.4 numpy version: 1.15.0 matplotlib version: 2.2.2 sklearn version: 0.19.2 nltk version: 3.3

Load the data from the csv file. It contains information about tickets closed for the last 1.5 years. Before writing the data to the file, they were a bit pre-processed. For example, commas and periods were removed from text fields with descriptions. However, this is only a preliminary processing and in the future the text will be further cleared.

Let's see what is in our data set. A total of 10,783 tickets were included.

Explanation of the fields

Created	Date and time of the ticket creation
Resolved	Date and time of closing the ticket
Resolution_time	The number of minutes elapsed between the creation and closing of the ticket. It is considered the calendar time, because the company has offices in different countries, working in different time zones and there is no fixed time for the entire department.
Engineer_N	“Encoded” names of engineers (in order not to inadvertently give out any personal or confidential information in the future, the article will contain quite a few “coded” data, which in fact are simply renamed). These fields contain the number of tickets in the “in progress” mode at the time of receipt of each of the tickets in the submitted date set. I will dwell on these fields separately towards the end of the article, since they deserve extra attention.
Assignee	The employee who was involved in solving the problem.
Issue_type	Type of ticket.
Environment	The name of the test environment for which the ticket was made (it can mean both the specific environment and the location as a whole, for example, a data center).
Priority	Ticket priority.
Worktype	Type of work that is expected on this ticket (adding or removing servers, updating the environment, working with monitoring, etc.)
Description	Description
Summary	Ticket title.
Watchers	The number of people who “watch” the ticket, i.e. they receive notifications by mail for each of the activities in the ticket.
Votes	The number of people who "voted" for the ticket, thereby showing its importance and its interest in it.
Reporter	The person who issued the ticket.
Engineer_N_vacation	Whether the engineer was on vacation at the time of registration of the ticket.

 df.info()

 <class 'pandas.core.frame.DataFrame'> Index: 10783 entries, ENV-36273 to ENV-49164 Data columns (total 37 columns): Created 10783 non-null object Resolved 10783 non-null object Resolution_time 10783 non-null int64 engineer_1 10783 non-null int64 engineer_2 10783 non-null int64 engineer_3 10783 non-null int64 engineer_4 10783 non-null int64 engineer_5 10783 non-null int64 engineer_6 10783 non-null int64 engineer_7 10783 non-null int64 engineer_8 10783 non-null int64 engineer_9 10783 non-null int64 engineer_10 10783 non-null int64 engineer_11 10783 non-null int64 engineer_12 10783 non-null int64 Assignee 10783 non-null object Issue_type 10783 non-null object Environment 10771 non-null object Priority 10783 non-null object Worktype 7273 non-null object Description 10263 non-null object Summary 10783 non-null object Watchers 10783 non-null int64 Votes 10783 non-null int64 Reporter 10783 non-null object engineer_1_vacation 10783 non-null int64 engineer_2_vacation 10783 non-null int64 engineer_3_vacation 10783 non-null int64 engineer_4_vacation 10783 non-null int64 engineer_5_vacation 10783 non-null int64 engineer_6_vacation 10783 non-null int64 engineer_7_vacation 10783 non-null int64 engineer_8_vacation 10783 non-null int64 engineer_9_vacation 10783 non-null int64 engineer_10_vacation 10783 non-null int64 engineer_11_vacation 10783 non-null int64 engineer_12_vacation 10783 non-null int64 dtypes: float64(12), int64(15), object(10) memory usage: 3.1+ MB

In total, we have 10 "object" fields (i.e. containing a text value) and 27 numeric fields.
First of all, we will immediately look for emissions in our data. As you can see, there are tickets that have a solution time of millions of minutes. This is clearly not relevant information, such data will only interfere with the construction of the model. They got here because the data was collected from JIRA by a query on the Resolved field, not on Created. Accordingly, those tickets that have been closed in the last 1.5 years have come here, but they could have been opened much earlier. It is time to get rid of them. Discard tickets that were created earlier on June 1, 2017. We will have 9493 tickets left.

As for the reasons, I think in each project it is easy to find such requests that have been hanging out for quite a long time due to various circumstances and more often close not by solving the problem itself, but after the “expiration of the limitation period”.

Source

 df[['Created', 'Resolved', 'Resolution_time']].sort_values('Resolution_time', ascending=False).head()

Source

 df = df[df['Created'] >= '2017-06-01 00:00:00'] print(df.shape)

 (9493, 33)

So let's start looking at what's interesting we can find in our data. To begin with, we will derive the simplest - the most popular environments among our tickets, the most active "reporters" and the like.

Source

 df.describe(include=['object'])

Source

 df['Environment'].value_counts().head(10)

 Environment_104 442 ALL 368 Location02 367 Environment_99 342 Location03 342 Environment_31 322 Environment_14 254 Environment_1 232 Environment_87 227 Location01 202 Name: Environment, dtype: int64

Source

 df['Reporter'].value_counts().head()

 Reporter_16 388 Reporter_97 199 Reporter_04 147 Reporter_110 145 Reporter_133 138 Name: Reporter, dtype: int64

Source

 df['Worktype'].value_counts()

 Support 2482 Infrastructure 1655 Update environment 1138 Monitoring 388 QA 300 Numbers 110 Create environment 95 Tools 62 Delete environment 24 Name: Worktype, dtype: int64

Source

 df['Priority'].value_counts().plot(kind='bar', figsize=(12,7), rot=0, fontsize=14, title='   ');

Well, something we have already learned. Most often, the priority for tickets is normal, about 2 times less high and even less often critical. Low priority is very rare, apparently people are afraid to exhibit it, believing that in this case it will hang in the queue for quite a long time and its decision time may be delayed. Later, when we build the model and analyze its results, we will see that such concerns may be not unreasonable, since low priority does affect the timing of the task and, of course, not in the direction of acceleration.

From the columns in the most popular environments and the most active reporters, we see Reporter_16 with a large margin, and Environment_104 in the first place in the environment. Even if you have not yet guessed, I will reveal a small secret - this reporter from the team working on this environment.
Let's look at what the environment most often come critical tickets.

Source

 df[df['Priority'] == 'Critical']['Environment'].value_counts().index[0]

 'Environment_91'

And now we will display information on how many tickets with other priorities are accounted for from the same “critical” environment.

Source

 df[df['Environment'] == df[df['Priority'] == 'Critical']['Environment'].value_counts().index[0]]['Priority'].value_counts()

 High 62 Critical 57 Normal 46 Name: Priority, dtype: int64

Let's look at the runtime of the ticket in the context of priorities. For example, it is amusing to note that the average execution time of a low priority ticket is more than 70 thousand minutes (almost 1.5 months). It is also easy to see the dependence of the execution time of a ticket on its priority.

Source

 df.groupby(['Priority'])['Resolution_time'].describe()

Or here is the median value as a graph. As you can see, the picture has not changed much, therefore, emissions do not really affect the distribution.

Source

 df.groupby(['Priority'])['Resolution_time'].median().sort_values().plot(kind='bar', figsize=(12,7), rot=0, fontsize=14);

Now let's look at the average solution time for each of the engineers, depending on how many tickets the engineer had at that time. In fact, these charts, to my surprise, do not show any single picture. For some, the runtime increases as the current tickets increase in work, while for some this dependence is reversed. For some, the dependence is not traced at all.

However, looking ahead again, I will say that the presence of this feature in dataset increased the accuracy of the model by more than 2 times and the impact on the execution time is definitely there. We just do not see it. And the model sees.

Source

 engineers = [i.replace('_vacation', '') for i in df.columns if 'vacation' in i] cols = 2 rows = int(len(engineers) / cols) fig, axes = plt.subplots(nrows=rows, ncols=cols, figsize=(16,24)) for i in range(rows): for j in range(cols): df.groupby(engineers[i * cols + j])['Resolution_time'].mean().plot(kind='bar', rot=0, ax=axes[i, j]).set_xlabel('Engineer_' + str(i * cols + j + 1)) del cols, rows, fig, axes

Long picture as a result

We will make a small matrix of pairwise interaction of the following features: the time of decision of the ticket, the number of votes and the number of observers. Bonus diagonally we have the distribution of each feature.

From the interesting - the dependence of the decrease in the time of decision of the ticket on the growing number of observers is visible. It is also clear that people use voices not very actively.

Source

 pd.scatter_matrix(df[['Resolution_time', 'Watchers', 'Votes']], figsize=(15, 15), diagonal='hist');

Picture-result

So, we conducted a small preliminary analysis of the data, saw the existing dependencies between the target attribute, which is the time of decision of the ticket, and such signs as the number of votes for the ticket, the number of "observers" behind it and its priority. Moving on.

Build a model. We build signs

It is time to move on to the construction of the model itself. But first we need to bring our signs in a clear form for the model. Those. decompose categorical features into sparse vectors and get rid of excess. For example, fields with the time of creation and closing of a ticket will not be needed in the model, as well as the field Assignee, since we will eventually use this model to predict the runtime of a ticket that has not been assigned to anyone ("zasasaynen").

The target feature, as I just mentioned, is the time to solve the problem for us, so we take it as a separate vector and also remove it from the general data set. In addition, some of our fields turned out to be empty due to the fact that reporters do not always fill in the description field when making a ticket. In this case, pandas sets their values to NaN, we just replace them with an empty string.

Source

 y = df['Resolution_time'] df.drop(['Created', 'Resolved', 'Resolution_time', 'Assignee'], axis=1, inplace=True) df['Description'].fillna('', inplace=True) df['Summary'].fillna('', inplace=True)

We decompose categorical features into sparse vectors ( One-hot encoding ). Until we touch the fields with the description and table of contents of the ticket. We will use them a little differently. Some reporter names contain [X]. So JIRA marks inactive employees who no longer work for the company. I decided to leave them among the signs, although it is possible to clear the data from them, because in the future, when using the model, we will not meet the tickets from these employees.

Source

 def create_df(dic, feature_list): out = pd.DataFrame(dic) out = pd.concat([out, pd.get_dummies(out[feature_list])], axis = 1) out.drop(feature_list, axis = 1, inplace = True) return out X = create_df(df, df.columns[df.dtypes == 'object'].drop(['Description', 'Summary'])) X.columns = X.columns.str.replace(' \[X\]', '')

And now we will deal with the description field in the ticket. We will work with him in one of the easiest ways - we will collect all the words used in our tickets, count the most popular among them, discard the "extra" words - those that obviously cannot influence the result, like, for example, "please" (please - all communication in JIRA is strictly in English), which is the most popular. Yes, these are our polite people.

Also remove the " stop words ", according to the nltk library, and more thoroughly clear the text of unnecessary characters. Let me remind you that this is the easiest thing to do with the text. We do not " stemming " the words, you can also count the most popular N-grams of words, but we will limit ourselves to just that.

Source

 all_words = np.concatenate(df['Description'].apply(lambda s: s.split()).values) stop_words = stopwords.words('english') stop_words.extend(['please', 'hi', '-', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '(', ')', '=', '{', '}']) stop_words.extend(['h3', '+', '-', '@', '!', '#', '$', '%', '^', '&', '*', '(for', 'output)']) stop_symbols = ['=>', '|', '[', ']', '#', '*', '\\', '/', '->', '>', '<', '&'] words_series = pd.Series(list(all_words)) del all_words words_series = words_series[~words_series.isin(stop_words)] for symbol in stop_symbols: words_series = words_series[~words_series.str.contains(symbol, regex=False, na=False)]

After all this, our output is a pandas.Series object, containing all the words used. Let's look at the most popular ones and take the first 50 of the list to use as signs. For each ticket, we will look at whether this word is used in the description, and if so, put 1 in the corresponding column, otherwise 0.

Source

 usefull_words = list(words_series.value_counts().head(50).index) print(usefull_words[0:10])

 ['error', 'account', 'info', 'call', '{code}', 'behavior', 'array', 'update', 'env', 'actual']

Now in our common data set, we will create separate columns for the words we have chosen. You can get rid of the description field itself.

Source

 for word in usefull_words: X['Description_' + word] = X['Description'].str.contains(word).astype('int64') X.drop('Description', axis=1, inplace=True)

Let's do the same for the ticket header field.

Source

 all_words = np.concatenate(df['Summary'].apply(lambda s: s.split()).values) words_series = pd.Series(list(all_words)) del all_words words_series = words_series[~words_series.isin(stop_words)] for symbol in stop_symbols: words_series = words_series[~words_series.str.contains(symbol, regex=False, na=False)] usefull_words = list(words_series.value_counts().head(50).index) for word in usefull_words: X['Summary_' + word] = X['Summary'].str.contains(word).astype('int64') X.drop('Summary', axis=1, inplace=True)

Let's see what we ended up with in the matrix of attributes X and the vector of answers y.

Source

 print(X.shape, y.shape)

 ((9493, 1114), (9493,))

Now we divide this data into a training (training) sample and a test sample as a percentage of 75/25. Total we have 7119 examples on which we will train, and 2374 on which we will estimate our models. And the dimension of our matrix of attributes increased to 1114 due to the unfolding of categorical signs.

Source

 X_train, X_holdout, y_train, y_holdout = train_test_split(X, y, test_size=0.25, random_state=17) print(X_train.shape, X_holdout.shape)

 ((7119, 1114), (2374, 1114))

We train the model.

Linear regression

Let's start with the easiest and (expectedly) least accurate model - linear regression. We will evaluate both the accuracy of the training data and the delayed sample — data that the model did not see.

In the case of linear regression, the model more or less acceptably shows itself on the training data, but the accuracy on the delayed sample is monstrously low. Even worse than predicting a normal average for all tickets.

Here you need to make a short pause and tell how the model assesses quality using its score method.
Evaluation is made by the coefficient of determination :

R^{2} = 1 - s u m_{i = 1}^{m} (y_{i} - h a t y)^{2} o v e r s u m_{i = 1}^{m} (y_{i} - o v e r l i n e y)^{2}

$R ^ 2 = 1 - {\ sum_ {i = 1} ^ {m} (y_ {i} - \ hat {y}) ^ 2 \ over \ sum_ {i = 1} ^ {m} (y_ {i } - \ overline {y}) ^ 2}$

Where $\ hat {y}$ - the result predicted by model a $\ overline {y}$ - average over the entire sample.

We will not dwell too much on the coefficient now. We note only that it does not fully reflect the accuracy of the model that interests us. Therefore, at the same time we will use Mean Absolute Error (MAE) to estimate and rely on it.

Source

 lr = LinearRegression() lr.fit(X_train, y_train) print('R^2 train:', lr.score(X_train, y_train)) print('R^2 test:', lr.score(X_holdout, y_holdout)) print('MAE train', mean_absolute_error(lr.predict(X_train), y_train)) print('MAE test', mean_absolute_error(lr.predict(X_holdout), y_holdout))

 R^2 train: 0.3884389470220214 R^2 test: -6.652435243123196e+17 MAE train: 8503.67256637168 MAE test: 1710257520060.8154

Gradient boosting

So where do without it, without gradient boosting? Let's try to train the model and see what we can do. We will use the well-known XGBoost for this. Let's start with the standard settings of the hyperparameters.

Source

 import xgboost xgb = xgboost.XGBRegressor() xgb.fit(X_train, y_train) print('R^2 train:', xgb.score(X_train, y_train)) print('R^2 test:', xgb.score(X_holdout, y_holdout)) print('MAE train', mean_absolute_error(xgb.predict(X_train), y_train)) print('MAE test', mean_absolute_error(xgb.predict(X_holdout), y_holdout))

 R^2 train: 0.5138516547636054 R^2 test: 0.12965507684512545 MAE train: 7108.165167471887 MAE test: 8343.433260957032

The result of the box is no longer bad. Let's try to model the model by selecting hyper parameters: n_estimators, learning_rate and max_depth. As a result, we will dwell on the values of 150, 0.1 and 3, respectively, as showing the best result on the test sample in the absence of overtraining of the model on the training data.

We select n_estimators

* Instead of R ^ 2 Score in the picture should be MAE.

 xgb_model_abs_testing = list() xgb_model_abs_training = list() rng = np.arange(1,151) for i in rng: xgb = xgboost.XGBRegressor(n_estimators=i) xgb.fit(X_train, y_train) xgb.score(X_holdout, y_holdout) xgb_model_abs_testing.append(mean_absolute_error(xgb.predict(X_holdout), y_holdout)) xgb_model_abs_training.append(mean_absolute_error(xgb.predict(X_train), y_train)) plt.figure(figsize=(14, 8)) plt.plot(rng, xgb_model_abs_testing, label='MAE test'); plt.plot(rng, xgb_model_abs_training, label='MAE train'); plt.xlabel('Number of estimators') plt.ylabel('$R^2 Score$') plt.legend(loc='best') plt.show();

We select learning_rate

 xgb_model_abs_testing = list() xgb_model_abs_training = list() rng = np.arange(0.05, 0.65, 0.05) for i in rng: xgb = xgboost.XGBRegressor(n_estimators=150, random_state=17, learning_rate=i) xgb.fit(X_train, y_train) xgb.score(X_holdout, y_holdout) xgb_model_abs_testing.append(mean_absolute_error(xgb.predict(X_holdout), y_holdout)) xgb_model_abs_training.append(mean_absolute_error(xgb.predict(X_train), y_train)) plt.figure(figsize=(14, 8)) plt.plot(rng, xgb_model_abs_testing, label='MAE test'); plt.plot(rng, xgb_model_abs_training, label='MAE train'); plt.xlabel('Learning rate') plt.ylabel('MAE') plt.legend(loc='best') plt.show();

We select max_depth

 xgb_model_abs_testing = list() xgb_model_abs_training = list() rng = np.arange(1, 11) for i in rng: xgb = xgboost.XGBRegressor(n_estimators=150, random_state=17, learning_rate=0.1, max_depth=i) xgb.fit(X_train, y_train) xgb.score(X_holdout, y_holdout) xgb_model_abs_testing.append(mean_absolute_error(xgb.predict(X_holdout), y_holdout)) xgb_model_abs_training.append(mean_absolute_error(xgb.predict(X_train), y_train)) plt.figure(figsize=(14, 8)) plt.plot(rng, xgb_model_abs_testing, label='MAE test'); plt.plot(rng, xgb_model_abs_training, label='MAE train'); plt.xlabel('Maximum depth') plt.ylabel('MAE') plt.legend(loc='best') plt.show();

Now we will train the model with selected hyper parameters.

Source

 xgb = xgboost.XGBRegressor(n_estimators=150, random_state=17, learning_rate=0.1, max_depth=3) xgb.fit(X_train, y_train) print('R^2 train:', xgb.score(X_train, y_train)) print('R^2 test:', xgb.score(X_holdout, y_holdout)) print('MAE train', mean_absolute_error(xgb.predict(X_train), y_train)) print('MAE test', mean_absolute_error(xgb.predict(X_holdout), y_holdout))

 R^2 train: 0.6745967150462303 R^2 test: 0.15415143189670344 MAE train: 6328.384400466232 MAE test: 8217.07897417256

The final result with the selected parameters and the visualization of feature importance - the importance of signs in the model's opinion. In the first place is the number of observers for the ticket, and then 4 engineers go at once. Accordingly, the time of solving the ticket can be quite affected by the employment of this or that engineer. And it is quite logical that the free time of some of them is more important. If only because the team has both senior engineers and the middle (we have no juniors in the team). By the way, again in secret, the engineer in the first place (orange bar) is indeed one of the most experienced among the whole team. Moreover, all 4 of these engineers have a senior prefix in their position. It turns out that the model is once again confirmed.

Source

 features_df = pd.DataFrame(data=xgb.feature_importances_.reshape(1, -1), columns=X.columns).sort_values(axis=1, by=[0], ascending=False) features_df.loc[0][0:10].plot(kind='bar', figsize=(16, 8), rot=75, fontsize=14);

Neural network

But we will not dwell on one gradient boosting and we will try to train the neural network, or rather the Multilayer perceptron (Multilayer perceptron), a fully connected forward propagation neural network. This time we will not start with the standard settings of the hyperparameters, since In the sklearn library, which we will use, by default only one hidden layer with 100 neurons and during training the model gives a warning about non-match in standard 200 iterations. We use 3 hidden layers with 300, 200 and 100 neurons, respectively.

As a result, we see that the model is not weakly overtraining on the training set, which, however, does not prevent it from showing a decent result on the test set. This result is quite a bit inferior to the result of the gradient boosting.

Source

 from sklearn.neural_network import MLPRegressor nn = MLPRegressor(random_state=17, hidden_layer_sizes=(300, 200 ,100), alpha=0.03, learning_rate='adaptive', learning_rate_init=0.0005, max_iter=200, momentum=0.9, nesterovs_momentum=True) nn.fit(X_train, y_train) print('R^2 train:', nn.score(X_train, y_train)) print('R^2 test:', nn.score(X_holdout, y_holdout)) print('MAE train', mean_absolute_error(nn.predict(X_train), y_train)) print('MAE test', mean_absolute_error(nn.predict(X_holdout), y_holdout))

 R^2 train: 0.9771443840549647 R^2 test: -0.15166596239118246 MAE train: 1627.3212161350423 MAE test: 8816.204561947616

Let's see what we can achieve by trying to find the best architecture of our network. To begin with, we will train several models with one hidden layer and one with two, just to make sure once again that models with one layer do not have time to converge in 200 iterations and, as can be seen from the graph, they can converge for a very, very long time. .

 plt.figure(figsize=(14, 8)) for i in [(500,), (750,), (1000,), (500,500)]: nn = MLPRegressor(random_state=17, hidden_layer_sizes=i, alpha=0.03, learning_rate='adaptive', learning_rate_init=0.0005, max_iter=200, momentum=0.9, nesterovs_momentum=True) nn.fit(X_train, y_train) plt.plot(nn.loss_curve_, label=str(i)); plt.xlabel('Iterations') plt.ylabel('MSE') plt.legend(loc='best') plt.show()

. 3 10 .

 plt.figure(figsize=(14, 8)) for i in [(500,300,100), (80, 60, 60, 60, 40, 40, 40, 40, 20, 10), (80, 60, 60, 40, 40, 40, 20, 10), (150, 100, 80, 60, 40, 40, 20, 10), (200, 100, 100, 100, 80, 80, 80, 40, 20), (80, 40, 20, 20, 10, 5), (300, 250, 200, 100, 80, 80, 80, 40, 20)]: nn = MLPRegressor(random_state=17, hidden_layer_sizes=i, alpha=0.03, learning_rate='adaptive', learning_rate_init=0.001, max_iter=200, momentum=0.9, nesterovs_momentum=True) nn.fit(X_train, y_train) plt.plot(nn.loss_curve_, label=str(i)); plt.xlabel('Iterations') plt.ylabel('MSE') plt.legend(loc='best') plt.show()

"" (200, 100, 100, 100, 80, 80, 80, 40, 20) :
2506
7351

, , . learning rate .

 nn = MLPRegressor(random_state=17, hidden_layer_sizes=(200, 100, 100, 100, 80, 80, 80, 40, 20), alpha=0.1, learning_rate='adaptive', learning_rate_init=0.007, max_iter=200, momentum=0.9, nesterovs_momentum=True) nn.fit(X_train, y_train) print('R^2 train:', nn.score(X_train, y_train)) print('R^2 test:', nn.score(X_holdout, y_holdout)) print('MAE train', mean_absolute_error(nn.predict(X_train), y_train)) print('MAE test', mean_absolute_error(nn.predict(X_holdout), y_holdout))

 R^2 train: 0.836204705204337 R^2 test: 0.15858607391959356 MAE train: 4075.8553476632796 MAE test: 7530.502826043687

, . , . , , .

. : ( , 200 ). , "" . , 30 200 , issue type: Epic . , .. , , , , . 4 5 . , . , .

— 9 , . , , , .

 pd.Series([X_train.columns[abs(nn.coefs_[0][:,i]).argmax()] for i in range(nn.hidden_layer_sizes[0])]).value_counts().head(5).sort_values().plot(kind='barh', title='Feature importance', fontsize=14, figsize=(14,8));

. What for? 7530 8217. (7530 + 8217) / 2 = 7873, , , ? No not like this. , . , 7526.

, kaggle . , , .

 nn_predict = nn.predict(X_holdout) xgb_predict = xgb.predict(X_holdout) print('NN MSE:', mean_squared_error(nn_predict, y_holdout)) print('XGB MSE:', mean_squared_error(xgb_predict, y_holdout)) print('Ensemble:', mean_squared_error((nn_predict + xgb_predict) / 2, y_holdout)) print('NN MAE:', mean_absolute_error(nn_predict, y_holdout)) print('XGB MSE:', mean_absolute_error(xgb_predict, y_holdout)) print('Ensemble:', mean_absolute_error((nn_predict + xgb_predict) / 2, y_holdout))

 NN MSE: 628107316.262393 XGB MSE: 631417733.4224195 Ensemble: 593516226.8298339 NN MAE: 7530.502826043687 XGB MSE: 8217.07897417256 Ensemble: 7526.763569558157

? 7500 . Those. 5 . . . , .

( ):

 ((nn_predict + xgb_predict) / 2 - y_holdout).apply(np.abs).sort_values(ascending=False).head(10).values

 [469132.30504392, 454064.03521379, 252946.87342439, 251786.22682697, 224012.59016987, 15671.21520735, 13201.12440327, 203548.46460229, 172427.32150665, 171088.75543224]

. , .

 df.loc[((nn_predict + xgb_predict) / 2 - y_holdout).apply(np.abs).sort_values(ascending=False).head(10).index][['Issue_type', 'Priority', 'Worktype', 'Summary', 'Watchers']]

, - , . 4 .

, .

 print(((nn_predict + xgb_predict) / 2 - y_holdout).apply(np.abs).sort_values().head(10).values) df.loc[((nn_predict + xgb_predict) / 2 - y_holdout).apply(np.abs).sort_values().head(10).index][['Issue_type', 'Priority', 'Worktype', 'Summary', 'Watchers']]

 [ 1.24606014, 2.6723969, 4.51969139, 10.04159236, 11.14335444, 14.4951508, 16.51012874, 17.78445744, 21.56106258, 24.78219295]

, , - , - . , , , .

Engineer

, 'Engineer', , , ? .

, 2 . , , , , . , , , "" , ( ) , , , . , " ", .

, . , , 12 , ( JQL JIRA):

 assignee was engineer_N during (ticket_creation_date) and status was "In Progress"

10783 * 12 = 129396 , … . , , , .. 5 .
, , , , 2 . .

. SLO , .

, , ( : - , - , - ) , .

Source: https://habr.com/ru/post/433166/

All Articles

Predicting the time to solve a ticket using machine learning

Preliminary data analysis

Build a model. We build signs

We train the model.

Linear regression

Gradient boosting

Neural network

Engineer

More articles: