Currently, to build a scoring model, the de facto standard in the financial industry is the use of logistic regression functions (logit functions). The essence of the method is reduced to finding such a linear combination of initial data (predictors), which as a result of the logit-transformation will be the most likely to make predictions.
A practical disadvantage of the method is the need for long-term data preparation for building a model (about a week of specialist work). In real life conditions of a microfinance company, a set of data on borrowers is constantly changing, various data providers are connected and disconnected, loan generations change - the preparation stage becomes a bottleneck.
Another disadvantage of logit functions is related to their linearity — the effect of each individual predictor on the final result is uniform over the entire set of predictor values.
Models based on neural networks are devoid of these shortcomings, but are rarely used in the industry - there are no reliable methods for evaluating retraining, a great influence of “noisy” values in the source data.
')
Below we will show how using various methods of optimizing a model based on neural networks, we can get a better prediction result compared to models based on logit functions.
1. Statement of the problem of simplifying the structure of a mathematical model and its solution using nonsmooth regularization methods (by the example of a linear model)
1.1 Statement of the problem of building a model
Most applied research has the goal of establishing a natural relationship between some measurable quantity and several factors.
$$ display $$ E (y / x) = f (x) = f (x, w), \ \ mathit {w \ theta} \ in W \ subset R ^ m, \ x \ in \ Omega \ subset R ^ n, \ \ \ \ (1) $$ display $$
Where
$ inline $ E (y / x) $ inline $ - average value of the observed value
$ inline $ y $ inline $ dependent on variables
$ inline $ x, W $ inline $ and
$ inline $ \ Omega $ inline $ - admissible sets of parameters
$ inline $ w $ inline $ and
$ inline $ x $ inline $ . Dependency recovery is based on observation data.
$$ display $$ D = \ {\; (x ^ i, y_i) | x ^ i \ in R ^ n, \; \; \; \; y_i \ in R ^ 1; \ i = 1, .. ., N \}. \ \ \ \ (2) $$ display $$
Parameter Estimates
$ inline $ w $ inline $ can be obtained, for example, by the method of least squares
$$ display $$ w = \ text {arg} \ underset {\ theta \ in \ Theta} {\ text {min}} \; \; E (w, D), \ \ E (w, D) = \ overset N {\ underset {i = 1} {\ sum}} [y_i-f (x ^ i, w)] ^ 2. \ \ \ \ (3) $$ display $$
1.2. Linear model
In the problem of constructing a linear model, it is required from the data D to build a model of the following type (estimate its unknown parameters
$ inline $ w $ inline $ )
$$ display $$ f (x, w) = w_0 + \ sum _ {i = 1} ^ mw_ix_i, \ \ \ \ (4) $$ display $$
Where
$ inline $ x_ {j (i)} $ inline $ - vector components
$ inline $ x \ in R ^ n $, $ w = ([w_0, w_i, i = 1, ..., m]) $ inline $ - a set of unknown parameters that must be estimated using the least squares method (3),
$ inline $ m $ inline $ - number of informative vector components
$ inline $ x \ in R ^ n $ inline $ involved in the model, the n-dimension of the vector
$ inline $ x $ inline $ .
1.3. Logit models
Logit model has the form
$$ display $$ f (x, w) = \; \ varphi (s), \ \ \ \ (5) $$ display $$
Where
$$ display $$ s (x, w) = w_0 + \ sum _ {i = 1} ^ mw_ix_i, \ \ \ \ (6) $$ display $$
and the activation function can be set by one of the following types
$$ display $$ \ varphi (s) = \ frac {s} {1+ | s |}, \ \ \ varphi '(s) = \ frac {1} {(1+ | s |) ^ 2}; \ \ \ (7) $$ display $$
$$ display $$ \ varphi (s) = \ frac 1 {1+ \ text {exp} (- s)}, \ \ \ varphi '(s) = \ varphi (s) (1- \ varphi (s) ), \ \ \ \ (8) $$ display $$
$$ display $$ \ varphi (s) = s, \ \ \ \ \ varphi '(s) = 1. \ \ \ \ (9) $$ display $$
The last of the functions is linear. Along with (7) - (8), it can be used to compare the quality of approximation with (7) - (8).
1.4. Two-layer sigmoidal neural network (with one hidden layer)
In the problem of approximation by a network of direct distribution, it is required according to
$ inline $ D $ inline $ train the two-layer sigmoid neural network (NS) of the following type (estimate its unknown parameters
$ inline $ w $ inline $ )
$$ display $$ f (x, w) = w_0 ^ {(2)} + \ sum _ {i = 1} ^ mw_i ^ {(2)} \; \ varphi \; (\ sum _ {j = 1 } ^ nx_j \; w _ {{\ text {ij}}} ^ {(1)} \; \; + w _ {\ mathit {i0}} ^ {(1)}), \ \ \ \ (10) $ $ display $$
Where
$ inline $ x_j \; $ inline $ - vector components
$ inline $ x \ in R ^ n $ inline $ ,
$ inline $ w = ([w_0 ^ {(2)}, w_i ^ {(2)}, i = 1, ..., m]; $ inline $
$ inline $ \; \; \; [w _ {\ mathit {i0}} ^ {(1)}, (w _ {{\ text {ij}}} ^ {(1)}, \; \; j = 1 , ..., n)] \;, \; i = 1, ..., m) $ inline $ - a set of unknown parameters that must be estimated using the least squares method (3),
$ inline $ \ varphi (s) $ inline $ - activation function of the neuron,
$ inline $ m $ inline $ - the number of neurons
$ inline $ n $ inline $ - vector dimension
$ inline $ x $ inline $ .
1.5. Sigmoid neural network activation functions
We present the activation functions of the sigmoid species and their derivatives, which we will use:
$$ display $$ \ varphi (s) = \ frac {s} {1+ | s |}, \ \ \ \ \ varphi '(s) = \ frac {1} {(1+ | s |) ^ 2 }; \ \ \ \ (11) $$ display $$
$$ display $$ \ varphi (s) = \ frac {1} {1+ \ text {exp} (- s)}, \ \ \ \ \ varphi '(s) = \ varphi (s) (1- \ varphi (s)). \ \ \ \ (12) $$ display $$
$$ display $$ \ varphi (s) = s, \ \ \ \ \ varphi '(s) = 1. \ \ \ \ (13) $$ display $$
1.6. Input Preprocessing
The main purpose of data preprocessing is to maximize the entropy of the input data. When all values of a variable are the same, then it does not carry information. And, on the contrary, if the values of a variable are uniformly distributed over a given interval, then its entropy is maximum.
To transform the components of variables in order to increase the degree of uniformity of the components of a variable, use the logit model formula
$$ display $$ x_i ^ + = 0.5- \ frac {1} {1+ \ text {exp} (- (x_i- \ bar x_i) / \ sigma _i)}, \; \; \; \; \; \; \; \; \; \ sigma _i = [(x_i- \ bar x_i) ^ 2 / N] ^ {0.5} \ \ \ \ (14) $$ display $$
1.7. Suppression of redundant variables and smoothing
To suppress redundant variables, prior training should be done by minimizing
$ inline $ w $ inline $ quadratic error and non-smooth smoothing functional
$$ display $$ E _ {\ Omega} \ left (\ alpha, w, D \ right) = \ underset {x, y \ in D} {\ sum} \ left (yf (x, w) \ right) ^ 2+ \ alpha \ Omega (w), \ \ \ \ (15) $$ display $$
$$ display $$ \ Omega \ left (w \ right) = \ underset {i \ in {\ text {Iw}}} {\ sum} | w_i | ^ {\ gamma}, \ \ 0 <\ gamma <1 , \ \ \ \ (16) $$ display $$
Where
$ inline $ \ alpha $ inline $ - regularization parameter,
$ inline $ {\ text {Iw}} $ inline $ - set of numbers of array variables
$ inline $ w $ inline $ - on which regularization is carried out. Functional
$ inline $ \ Omega \ left (w \ right) $ inline $ designed to suppress redundant model variables
$ inline $ f (x, w) $ inline $ . Therefore, the solution will contain a set of components close to zero, which must be eliminated using special algorithms.
2. Smoothing Functionals for Smoothing and Suppressing Excess Variables
2.1. Smooth regularization
Derivatives of a functional like (8)
$$ display $$ \ Omega \ left (w \ right) = \ overset n {\ underset {i = 1} {\ sum}} | w_i | ^ {\ gamma}, \ \ \ \ 0 <\ gamma <1 \ \ \ \ (17) $$ display $$
have the following form
$$ display $$ \ frac {\ partial \ Omega \ left (w \ right)} {\ partial w_i} = \ frac {\ beta \; {\ text {sign}} (w_i)} {| w_i | ^ { 1- \ gamma}}, \ \ i = 1, ..., n, \ \ 0 <\ gamma <1. \ \ \ \ (18) $$ display $$
With
$ inline $ w_i \ rightarrow 0 $ inline $ they will be arbitrarily large. This means that the angles of the stars - level surfaces degenerate into needles, which slows down the rate of convergence of minimization methods and leads to emergency premature stops.
The level lines of the functional (10) (the level lines of a star-like type) are shown in Figure 1.

In fig. 1 shows the interaction of two functionals (the main and smoothing) and the directions of their gradients and the resulting gradient.
2.2. A special case of nonsmooth regularization (Occam's razor)
Consider (8) subject to
$ inline $ \ gamma = 1 $ inline $
$$ display $$ \ Omega \ left (w \ right) = \ overset n {\ underset {i = 1} {\ sum}} | w_i |. \ \ \ \ (19) $$ display $$
Derivatives (10) have the following form
$$ display $$ \ frac {\ partial \ Omega \ left (w \ right)} {\ partial w_i} = {\ text {sign}} (w_i), \ i = 1, ..., n. \ \ \ \ (20) $$ display $$
The level surfaces have the form of rectangles symmetrically located relative to zero and rotated by 45 degrees. Function (10) is not smooth.
2.3. Smooth regularized bounded derivatives
In the following functional, we will get rid of the presence of angles degenerating into needles.
$$ display $$ \ Omega \ left (w \ right) = \ overset n {\ underset {i = 1} {\ sum}} (| w_i | + \ varepsilon) ^ {\ gamma}, \ \ \ \ \ \ varepsilon> 0, \ \ \ \ 0 <\ gamma <1 \ \ \ \ (21) $$ display $$
$$ display $$ \ frac {\ partial \ Omega \ left (w \ right)} {\ partial w_i} = \ frac {\ beta \; {\ text {sign}} (w_i)} {(| w_i | + \ varepsilon) ^ {1- \ gamma}}, \ \ i = 1, ..., n, \ \ 0 <\ gamma <1. \ \ \ \ (22) $$ display $$
The disadvantage (10) is heterogeneous sensitivity to the parameter
$ inline $ \ varepsilon $ inline $ with variations of the orders of the estimated parameters
$ inline $ w $ inline $ for various neural networks.
2.4. Smooth homogeneous regularization with bounded derivatives
In the next functional, we will get rid of the heterogeneity in the parameters
$ inline $ w $ inline $
$$ display $$ \ Omega \ left (w \ right) = \ overset n {\ underset {i = 1} {\ sum}} (| w_i | + \ varepsilon \ overset n {\ underset {i = 1} { \ sum}} | w_i |) ^ {\ gamma} = \ overset n {\ underset {i = 1} {\ sum}} v_i ^ {\ gamma} = \ overset n {\ underset {i = 1} {\ sum}} f_i, \ \ varepsilon> 0, \ \ varepsilon \ approx 10 ^ {- 4}, \ 0 <\ gamma <1, \ \ \ \ (23) $$ display $$
$$ display $$ \ frac {\ partial \ Omega \ left (w \ right)} {\ partial w_k} = \ frac {\ beta \ sign (w_k)} {(| w_k | + \ varepsilon \ overset n {\ underset {i = 1} {\ sum}} | w_k |) ^ {1- \ gamma}} + \ overset n {\ underset {i = 1} {\ sum}} \ frac {\ varepsilon \ beta \ sign ( w_k)} {(| w_i | + \ varepsilon \ overset n {\ underset {i = 1} {\ sum}} | w_i |) ^ {1- \ gamma}} = \\ = \ gamma \ sign (w_k) \ left [\ frac {f_k} {v_k} + \ varepsilon \ overset n {\ underset {i = 1} {\ sum} \ frac {f_i} {v_i} \ right] = \ gamma \; {sign} ( w_k) \ left [\ frac 1 {v_k ^ {1- \ gamma}} + \ varepsilon \ overset n {\ underset {i = 1} {\ sum}} \ frac {1_i} {v_i ^ {1- \ gamma }} \ right], \ k = 1, ..., n. \ \ \ \ (24) $$ display $$
Convert (12)
$$ display $$ \ Omega \ left (w \ right) = \ overset n {\ underset {i = 1} {\ sum}} (| w_i | + \ varepsilon \ overset n {\ underset {i = 1} { \ sum}} | w_i |) ^ {\ gamma} = \ left (\ overset n {\ underset {i = 1} {\ sum}} | w_i | \ right) ^ {\ gamma} \ overset n {\ underset {i = 1} {\ sum}} \ left (\ frac {| w_i |} {\ overset n {\ underset {i = 1} {\ sum}} | w_i |} + \ varepsilon \ right) ^ {\ gamma}, \ 0 <\ gamma <1 \ \ \ \ (25) $$ display $$
We introduce the normalized variables
$$ display $$ z_i = \ frac {| w_i |} {\ overset n {\ underset {i = 1} {\ sum}} | w_i |}, \ \ z_i \ in [0, \; \; 1], \ \ \ overset n {\ underset {i = 1} {\ sum}} | z_i | = 1. \ \ \ \ (26) $$ display $$
Then (16) will take the form
$$ display $$ \ Omega \ left (w \ right) = \ left (\ overset n {\ underset {i = 1} {\ sum}} | w_i | \ right) ^ {\ gamma} \ overset n {\ underset {i = 1} {\ sum}} \ left (\ frac {| w_i |} {\ overset n {\ underset {i = 1} {\ sum}} | w_i |} + \ varepsilon \ right) ^ { \ gamma} = \ left (\ overset n {\ underset {i = 1} {\ sum}} | w_i | \ right) ^ {\ gamma} \ overset n {\ underset {i = 1} {\ sum}} (z_i + \ varepsilon) ^ {\ gamma}, \ \ 0 <\ gamma <1, \ \ (27) $$ display $$
Denote the structure of the function. Here is the first factor
$ inline $ \ left (\ overset n {\ underset {i = 1} {\ sum}} | w_i | \ right) ^ {\ gamma} $ inline $ is a homogeneous function of degree
$ inline $ \ gamma $ inline $ and displays the overall growth of the function. The second factor in (16) is
homogeneous function of zero degree and determines the behavior of the function depending on the structure of the proportions between the variables.
We denote the properties of the functional (16), which determine its effectiveness.
- Level surfaces form similar shapes relative to the origin. The latter means the independence of the regularization of the scale of variables.
- The multiplier of the total growth of the function is a concave function, which determines the presence of extremes on the coordinate axes and therefore determines the properties of the possibility of deleting variables.
- The degree of concavity is set by parameter $ inline $ \ beta $ inline $ which can be chosen optimally on the basis of a preliminary computational experiment and further when calculating on this type of networks does not change
- The structure of the corner points is determined by the parameter $ inline $ \ varepsilon $ inline $ which can be chosen optimally on the basis of a preliminary computational experiment and further when calculating on this type of networks does not change
The following functions will be considered in order to have the properties we have identified, which are necessary to eliminate redundant variables.
2.5. Quadratic regularization (Tikhonov regularization)
Derivatives of a quadratic function
$$ display $$ \ Omega \ left (w \ right) = \ overset n {\ underset {i = 1} {\ sum}} w_i ^ 2 \ \ \ \ (28) $ display $$
have the following form
$$ display $$ \ frac {\ partial \ Omega \ left (w \ right)} {\ partial w_i} = 2w_i, i = 1, ..., n. \ \ \ \ (29) $$ display $$
It does not allow solving the problem of eliminating redundant variables, since it does not have property 2.
3. Results of numerical research
Logit-models and sigmoidal neural networks with non-smooth uniform regularization and Tikhonov's quadratic regularization were investigated on test and real data.
3.1. Research on real data of various models
Recovery of various dependencies
$ inline $ f (x, w) $ inline $ made on the basis of observation data
$$ display $$ D = \ {\; (x ^ i, y_i) | x ^ i \ in R ^ n, \; \; \; \; y_i = \ {0,1 \}; \; \; \; \; i = 1, ..., N \} \ \ \ \ (1) $$ display $$
where as quantities
$ inline $ y_i $ inline $ used default characteristics
$ inline $ (y_i = 1) $ inline $ or lack of default
$ inline $ (y_i = 0) $ inline $ . Estimates of unknown model parameters
$ inline $ w $ inline $ produced by the least squares method
$$ display $$ w = \ text {arg} \ \ underset {\ theta \ in \ Theta} {\ text {min}} \; \; E (w, D), \ \ \ \ E (w, D ) = \ overset N {\ underset {i = 1} {\ sum}} [y_i-f (x ^ i, w)] ^ 2. \ \ \ \ (2) $$ display $$
Conducted pre-processing of input data. The main purpose of data preprocessing is to maximize the entropy of the input data. When all values of a variable are the same, then it does not carry information. And, on the contrary, if the values of a variable are uniformly distributed on a given interval, then its entropy is maximum.
To transform the components of variables in order to increase the degree of uniformity of the components of a variable, use the logit-model formula
$$ display $$ x_i ^ + = 0.5- \ frac 1 {1+ \ text {exp} (- (x_i- \ bar x_i) / \ sigma _i)}, \; \; \; \; \; \; \; \; \; \ sigma _i = [(x_i- \ bar x_i) ^ 2 / N] ^ {0.5}. \ \ \ \ (3) $$ display $$
The quality of the models was evaluated on the basis of the AUC characteristic, which determines the area under
ROC curve.
The error curve or ROC curve is a graphical characteristic of the quality of a binary classifier, the dependence of the proportion of true positive classifications on the proportion of false positive classifications when the threshold of the decision rule is varied.
The advantage of the ROC curve is its invariance with respect to the price ratio of the error of type I and II.
The area under the AUC ROC-curve (Area Under Curve) is an aggregated characteristic of the quality of the classification, independent of the price ratio of errors. The higher the AUC value, the “better” the classification model. This indicator is often used for comparative analysis of several classification models.
3.2. The study of Logit-models with different types of regularization
Logit model
$$ display $$ f (x, w) = \; \ varphi (s), \\\\ s (x, w) = w_0 + \ sum _ {i = 1} ^ mw_ix_i, \ \ \ \ \ (4) $$ display $$
used with three kinds of activation function
$$ display $$ \ varphi (s) = s, \ \ \ \ \ varphi '(s) = 1. \ \ \ \ (5) $$ display $$
$$ display $$ \ varphi (s) = \ frac s {1+ | s |}, \ \ \ \ \ varphi '(s) = \ frac {1} {(1+ | s |) ^ 2}; \ \ \ \ (6) $$ display $$
$$ display $$ \ varphi (s) = \ frac {1} {1+ \ text {exp} (- s)}, \ \ \ \ \ varphi '(s) = \ varphi (s) (1- \ varphi (s)), \ \ \ \ (7) $$ display $$
which we will denote respectively LIN, ABS and EXP. Model coefficients were found by minimizing the function.
$$ display $$ E _ {\ Omega} \ left (\ alpha, w, D \ right) = \ underset {x, y \ in D} {\ sum} \ left (yf (x, w) \ right) ^ 2+ \ alpha \ Omega (w) \ \ \ \ (8) $$ display $$
As
$ inline $ \ Omega (w) $ inline $ The quadratic Tikhonov regularization function was used.
$$ display $$ \ Omega _2 \ left (w \ right) = \ overset n {\ underset {i = 1} {\ sum}} w_i ^ 2 \ \ \ \ (9) $$ display $$
and non-smooth homogeneous function with non-smooth regularization
$$ display $$ \ Omega _ {\ gamma} \ left (w \ right) = \ overset n {\ underset {i = 1} {\ sum} (| w_i | + \ varepsilon \ overset n {\ underset { i = 1} {\ sum}} | w_i |) ^ {\ gamma}, \ \ \ varepsilon> 0, \ \ \ varepsilon \ approx 10 ^ {- 6}, \ \ \ \ 0 <\ gamma <1 \ \ \ \ (10) $$ display $$
The regularization algorithm was attended by 2 stages. Some initial value was selected.
$ inline $ \ alpha _0 $ inline $ , and on subsequent iterations
$ inline $ \ alpha _ {k + 1} = 2 \ alpha _k $ inline $ it was a doubling
$ inline $ \ alpha _ {k + 1} = 2 \ alpha _k $ inline $ . With such values, the model was calculated and the variables were removed with excessively small coefficients. At each iteration, a model with some small value was also calculated.
$ inline $ \ alpha _ {\ text {min}} $ inline $ . This method involves smoothing and deleting variables for large regularization parameters.
$ inline $ \ alpha _k $ inline $ and free model building at small values. Models with small regularization parameters can be useful in the assumption that the variables remaining after deletion are significant for the construction of the model.
The following table shows the results of calculations of the model, the number of variables is nx = 254.

AUC_O - AUC on the training set
AUC_T - AUC on the test set
3.3. Conclusions study on the real data Logit-models
The best variants of models with quadratic regularization are obtained by means of a scenario with preliminary removal of a part of the model coefficients with large regularization parameters with subsequent calculation of model parameters with small regularization coefficients. Such scenarios require large regularization parameters, which can lead to the removal of significant components of the model.
The optimal model with non-smooth optimization was obtained for small values of the regularization parameters, which allows us to conclude that there is a simultaneous effect of removing weak variables and smoothing the rest of the variables.
A comparison of the average AUC_O and AUC_T models indicates that more efficient models are obtained on the basis of nonsmooth optimization.
Average results for logit models

3.4. The study of neural network models with different types of regularization
Two-layer sigmoidal neural networks (with one hidden layer) were built. In the problem of approximation by a network of direct propagation, it is required, according to data D, to train the following two-layer sigmoidal neural network (NN) (estimate its unknown parameters
$ inline $ w $ inline $ )
$$ display $$ f (x, w) = w_0 ^ {(2)} + \ sum _ {i = 1} ^ mw_i ^ {(2)} \; \ varphi \; (\ sum _ {j = 1 } ^ nx_j \; w _ {{\ text {ij}}} ^ {(1)} \; \; + w _ {\ mathit {i0}} ^ {(1)}), \ \ \ \ (11) $ $ display $$
Where
$ inline $ x_j $ inline $ - vector components
$ inline $ x \ in R ^ n $ inline $ ,
$ inline $ w = ([w_0 ^ {(2)}, w_i ^ {(2)}, i = 1, ..., m]; [w _ {\ mathit {i0}} ^ {(1)}, (w _ {{\ text {ij}}} ^ {(1)}, \; \; j = 1, ..., n)] \;, \; i = 1, ..., m) $ inline $ - a set of unknown parameters that must be estimated using the least squares method (3),
$ inline $ \ varphi (s) $ inline $ - activation function of the neuron,
$ inline $ m $ inline $ - the number of neurons
$ inline $ n $ inline $ - vector dimension
$ inline $ x $ inline $ .
The neural network model was used with two types of activation function.
$$ display $$ \ varphi (s) = \ frac {s} {1+ | s |}, \ \ varphi '(s) = \ frac {1} {(1+ | s |) ^ 2}; \ \ \ \ (12) $$ display $$
$$ display $$ \ varphi (s) = \ frac {1} {1+ \ text {exp} (- s)}, \ \ \ \ \ varphi '(s) = \ varphi (s) (1- \ varphi (s)) \ \ \ \ (13) $$ display $$
which we will denote respectively LIN, ABS and EXP. Model coefficients were found by minimizing the function.
$$ display $$ E _ {\ Omega} \ left (\ alpha, w, D \ right) = \ underset {x, y \ in D} {\ sum} \ left (yf (x, w) \ right) ^ 2+ \ alpha \ Omega (w) \ \ \ \ (14) $$ display $$
As
$ inline $ \ Omega (w) $ inline $ The quadratic Tikhonov regularization function was used.
$$ display $$ \ Omega _2 \ left (w \ right) = \ overset n {\ underset {i = 1} {\ sum}} w_i ^ 2 \ \ \ \ (15) $$ display $$
and non-smooth homogeneous function with non-smooth regularization
$$ display $$ \ Omega _ {\ gamma} \ left (w \ right) = \ overset n {\ underset {i = 1} {\ sum} (| w_i | + \ varepsilon \ overset n {\ underset { i = 1} {\ sum}} | w_i |) ^ {\ gamma}, \ varepsilon> 0, \ \ varepsilon \ approx 10 ^ {- 6}, \ \ 0 <\ gamma <1 \ \ \ \ (16 ) $$ display $$
The regularization algorithm was attended by 2 stages. Some initial value was selected.
$ inline $ \ alpha _ {00} $ inline $ , and on subsequent iterations
$ inline $ \ alpha _ {k + 1} = 2 \ alpha _k $ inline $ it was a doubling
$ inline $ \ alpha _ {k + 1} = 2 \ alpha _k $ inline $ . With such values, the model was calculated and the variables were removed with excessively small coefficients. At each iteration, a model with some small value was also calculated.
$ inline $ \ alpha _ {\ text {min}} $ inline $ . This method involves smoothing and deleting variables for large regularization parameters.
$ inline $ \ alpha _k $ inline $ and free model building at small values. Models with small regularization parameters can be useful in the assumption that the variables remaining after deletion are significant for the construction of the model.
Results for neural network without fixing centers. The following tables show the results of calculations of models, the number of variables which
$ inline $ nx $ inline $ = 254.

Results for neural network with fixing centers.

AUC_O - AUC on the training set
AUC_T - AUC on the test set
3.5. Conclusions of the research on real data of neural network models
The best variants of models with quadratic regularization are obtained by means of a scenario with preliminary removal of a part of the model coefficients with large regularization parameters with subsequent calculation of model parameters with small regularization coefficients. Such scenarios require large regularization parameters, which can lead to the removal of significant components of the model.
The optimal model with non-smooth optimization was obtained for small values of the regularization parameters, which allows us to conclude that there is a simultaneous effect of removing weak variables and smoothing the rest of the variables.
A comparison of the average AUC_O and AUC_T models indicates that more efficient models are obtained on the basis of nonsmooth optimization.
The second conclusion is that the preliminary fixation of the working areas of the neurons has a positive effect on obtaining a more efficient neural network model. Fixing neurons at the first stage does not allow the work areas to leave the data area, thereby leaving all the neurons working.
Average results for neural networks without fixing centers.

Average results for neural networks with fixing centers.

AUC_O - AUC on the training set
AUC_T - AUC on the test set
As our practice has shown, sigmoid neural networks with one inner layer can be successfully used to predict a borrower's default and show better results than models based on logistic regression functions. The drawbacks of models based on neural networks are successfully overcome by preliminary regularization of the input data and the execution of training the model with fixation of the working areas of the neurons.