How to find a smoker by cardiogram using artificial neural networks (and why it is needed)

On Habré already wrote about the scientific competition for mathematicians and developers, which launched the creators of the mobile cardiograph CardioQVARK. In short, the essence of the competition is to create an algorithm that could detect a smoker among non-smokers based on their cardiograms.

One of the competition leaders was Ph.D. Roman Isakov, Associate Professor of the Department of Biomedical and Electronic Means and Technologies of the Institute of Innovative Technologies, Vladimir AG State University, Vladimir and N.G. Stoletovs. He developed a method for determining a smoker based on RR-intervalograms and artificial neural networks - we'll talk about him today.
')

Why look for a smoker

There are studies of specialists in machine learning, which show that the ECG signal carries information about the functioning of all body systems, not just the heart. In addition, each disease in its own way “modulates” the ECG signal, and therefore the signs of increments of intervals and amplitudes of successive cardiocycles can be used to diagnose information about possible health problems in humans, including in the early stages of their occurrence.

The report at the V International Conference "Mathematical Biology and Bioinformatics" Konstantin Vorontsov from the Computer Center to them. A. A. Dorodnitsyna RAS showed differences in the signs of the increment intervals (dRn), amplitudes (dTn) and angles (dαn) of cardiocycles in healthy and suffering from various diseases of people

The search for a smoker by cardiogram will help to achieve the main goal of the competition - to obtain a result that would demonstrate the possibility or impossibility of implementing high-quality diagnostics using an ECG and detection algorithms for a cardiogram signal of disease markers of various organs.

The essence of the proposed method

The solution of the problem was based on the hypothesis about the dependence of heart rate variability (HRV) on the functional state of the body [R.M.Baevsky and others.] This model includes feedback through the brain through the peripheral nervous system, allowing to control blood flow, including including by dynamically controlling the heart rate.

On this basis, the RR intervalogram was chosen as the main signal for analysis. This signal contains all the information about the processes of controlling the rhythm of the heart in its final manifestation.

The problem in extracting information about the effects of nicotine and other substances on the human body comes down to finding the parameters of HRV, which have the greatest separating ability of the classes of smokers and non-smokers. Considering that the nature of the interrelation of parameters can be non-linear, the classifier was based on the technology of artificial neural networks.

The training cardiogram sample for the competition included 100 records of smokers and non-smokers with a ratio of 50/50%. A control choice was also presented, which included 250 cardiograms - no annotations were presented in it, so it was impossible to use it for research.

Therefore, the researcher needed to divide the training set into two equal “subsamples”: training and test.

The selection of records in the test and training subsamples was carried out in an arbitrary manner, but with the observance of the condition of an equal ratio of smokers and non-smokers in each of them. Since the number of records in the training subsample turned out to be too small, then at the final stage after choosing the best model, we also had to take it by “completing training” on the records of the test subsample.

Not so simple

To minimize the phenomenon of retraining from the data set used to
learning, allocated local validation set (20%) in random order. He did not participate in the adjustment of the model parameters and served to monitor the model error. As the error on the validation kit increased, the training stopped.

There is a likelihood that people who have hidden this fact or passive smokers fall into the class “do not smoke”, and people who have little “experience” of smoking are in the “smoke” class. Therefore, one of the studies was made with a modification of the training database based on neural network analysis of the sample using the best of the obtained models. As a result, those records were modified whose discrepancy with the model was the largest. This approach showed a slight increase in efficiency on an independent (validation) sample. However, it can be assumed that it also contains erroneous labels, which is a limiting factor.

Data processing and analysis

In order to form the feature space for a smoking person recognition model, the researcher studied various known statistical parameters, special parameters for assessing heart rate variability, as well as the spectrum and histogram of heart rhythm.

The parameters were divided into the following groups:

Entropy;
Parameters of the time domain;
Frequency domain parameters;
The parameters of the form of the histogram.

The study consisted in calculating the entire set of parameters for classes of smokers and non-smokers in the records of the training base and the subsequent joint analysis of their distributions. Only those parameters were selected, the distribution density of which had significant differences in any area.

Additionally, the heart rhythm spectra were studied, the frequency ranges in which the greatest separation of the two classes was observed was selected. Then, a cross-correlation analysis of the selected parameters was performed to exclude strong linear connections in the feature space.

In the description of the competitive decision, the researcher notes that parallel studies of a set of parameters were carried out without optimization by correlation analysis and using readings of the heart rhythm spectrum. The results of the data in the solution are not given, because they did not show the best results.

As a result, the following set of parameters was obtained:

1) EnLog - Entropy of "logarithmic energy" (Log Energy Entropy);
2) EnTrs - Threshold Entropy;
3,4) EnSamp - Two sample entropies (Sample Entropy) with parameter 1 and 5;
5) NN22 - The number of consecutive RR-intervals, differing by more than 22 ms;
6) HRVTi - Triangular index of the heart rate histogram;
7) LF / HF - The ratio of the power of the low-frequency to the high-frequency part of the spectrum (the standard parameter for estimating HRV);
8) LFn - The ratio of the power of the low-frequency part of the spectrum to the sum of the powers of the low-frequency and high-frequency parts of the spectrum;
9) SBxn (4) - The ratio of the power of the spectrum in the range from 0.093 Hz to 0.125 Hz to the total power of the spectrum (TR). This parameter was obtained as a result of a special spectral analysis;
10) SB1n - Spectrum power in the range from 0.0039 Hz to 0.0391 Hz. This parameter was obtained as a result of a special spectral analysis.

The data processing algorithm can be described step by step as follows:

In the first step, the cardiointervalogram (CIG) is loaded. Then, using cut-off at level 1 of the MSE, emissions are determined. Further, they are eliminated using median interpolation; TIG spline interpolation is performed to obtain an equidistant quantized rhythmogram (RG) signal.

To remove the constant component, an average value was subtracted from the rhythmogram, after which it was processed by the Türks window to suppress the Gibbs effect. Subsequently, a fast Fourier transform was performed for the processed rhythmogram, and by calculating the absolute value of the complex values of this transformation, it was possible to obtain a spectrum of the heart rhythm.

The above parameters were calculated using CIG (except spectral parameters), and then they were normalized to obtain a dynamic range from 0 to 1.

The model was obtained as follows:

At first, perceptron neural networks (NS) were trained with a successively increasing number of neurons in hidden layers (according to the previously described method). The result is a set of neural network models of different sizes, allowing you to select the optimal size of the neural network.

Next, we analyzed the set of NA on the test subsample and from it by the parameter AUC you
The best were taken.

The third step was to adjust the cut-off threshold of the selected models using ROC analysis by balancing the Sensitivity and Specificity to obtain their minimum difference. Sensitivity or Specificity values of less than 50% were discarded.

According to this method, the following NA structures were investigated:

bilayer, with one hidden sigmoid layer and sigmoidal exit (SS);
three-layer with two hidden tapering sigmoidal layers and sigmoidal output (SSdS);
three-layer with two hidden tapering sigmoidal layers and linear output (SSdP).

results

From the test results it can be seen that, on average, the efficiency indicators of the classifier are in the region of 60-70%.

At the same time, the researcher notes that the training and test samples provided for the competition contained erroneous labels. This reduces the efficiency of the models proposed by him, which means that using “pure” data one can expect an increase in the efficiency of the created classifier.

In addition, according to the author of the study, an increase in the size of the training database can also play a positive role.

On an independent sample of data, the researcher managed to achieve Sensitivity indicators at the level of 63% and Specificity at the level of 71%.

The result of the work carried out in the framework of a scientific competition demonstrates the presence of a theoretical and experimentally confirmed connection between heart rate variability and functional changes in the body associated with smoking.

Source: https://habr.com/ru/post/392425/

All Articles