Big Data Analysis Problems

What are the problems of analyzing Big Data?

Big Data creates prominent features that are not shared by traditional data sets. These features create significant problems for data analysis and motivate the development of new statistical methods. Unlike traditional data sets, where the sample size is usually larger than the measurement, Big Data is characterized by a huge sample size and high dimensionality. First, we will discuss the effect of large sample size on understanding heterogeneity: on the one hand, a large sample size allows us to uncover hidden patterns associated with small subgroups of the population and a weak community among the entire population. On the other hand, modeling the internal heterogeneity of Big Data requires more complex statistical methods. Secondly, we will discuss several unique phenomena associated with high dimensionality, including noise accumulation, false correlation, and random endogeneity. These unique features make traditional statistical procedures null and void.

Heterogeneity

Big Data is often created by combining multiple data sources corresponding to different subgroups. Each subgroup may exhibit some unique features that are not shared by others. In classical conditions, when the sample size is small or moderate, data points from small subpopulations are usually classified as “deviations”, and they are systematically difficult to model due to the insufficient number of observations. However, in the era of Big Data, the large sample size allows us to better understand heterogeneity, shedding light on research, such as studying the relationship between certain covariates (for example, genes or SNP) and rare results (for example, rare diseases or diseases in small populations) and understanding that why certain treatments (for example, chemotherapy) benefit one population and harm another. To better illustrate this point, we introduce the following model for the population:

$$ display $$ λ1p1 (y; θ1 (x)) + + λmpm (y; θm (x)), λ1p1 (y; θ1 (x)) + + λmpm (y; θm (x)), ( 1) $$ display $$

Where λj ≥ 0 represents the fraction of the jth subgroup, pj (y; θj (x)) is the probability distribution of the response of the jth subgroup, taking into account the covariates of x with θj (x) as the parameter vector. In practice, many subpopulations are rarely observed, that is, λj is very small. When the sample size n is moderate, nλj may be small, which makes it impossible to derive covariant-dependent parameters θj (x) due to lack of information. However, since Big Data is characterized by a large sample size n, the sample size nλj for the j-th population can be moderately large, even if λj is very small. This allows us to more accurately draw a conclusion about the parameters of the subpopulation θj (·). In short, the main advantage of Big Data is the understanding of the heterogeneity of subpopulations, such as the advantages of certain personalized treatments that are impossible with a small or moderate sample size.

Big Data also allows us, due to the large sample size, to reveal a weak community among the entire population. For example, assessing the benefits of the heart of one glass of red wine a day can be difficult without a large sample size. Similarly, the health risks associated with exposure to certain environmental factors can be assessed more convincingly only when the sample sizes are large enough.
')
In addition to the advantages mentioned above, the heterogeneity of Big Data also creates significant problems for statistical inference. The derivation of a mixture model in (1) for large data sets requires complex statistical and computational methods. In small dimensions, standard methods can be applied, such as the expect-maximization algorithm for models of finite mixtures. In large sizes, however, we need to carefully streamline the evaluation procedure to avoid retraining or noise accumulation and develop good computational algorithms.

Noise accumulation

Big Data Analysis requires us to simultaneously evaluate and check many parameters. Estimation errors accumulate when the decision or rule of forecasting depends on a large number of such parameters. Such an effect of noise accumulation is especially serious in large dimensions and may even dominate the true signals. This is usually handled by the rarefaction assumption.

Take, for example, a multidimensional classification. Bad classification is due to the presence of many weak points that do not contribute to the reduction of classification errors. As an example, consider the classification problem when data comes from two classes:

$$ display $$ X1, and Y1, ……, Xn∼Nd (μ1, Id), Yn∼Nd (μ2, Id) .X1, ..., Xn∼Nd (μ1, Id) and Y1,…, Yn∼ Nd (μ2, Id). (2) $$ display $$

We want to build a classification rule that classifies a new observation Z∈RdZ∈Rd either in the first or second class. To illustrate the effect of noise accumulation in the classification, we set n = 100 and d = 1000. We set μ1 = 0μ1 = 0 and μ2 as sparse, i.e. only the first 10 μ2 entries are nonzero with a value of 3, and all other entries are zero. Figure 1 shows the first two main components using the first m = 2, 40, 200 elements and as many as 1000 elements. As shown in these graphs, when m = 2, we get a high degree of discrimination. However, the discriminating power becomes very low when m is too large due to noise accumulation. The first 10 functions contribute to the classification, and the others do not. Therefore, when m> 10, the procedures do not receive any additional signals, but accumulate noise: the more m, the more noise accumulates, which worsens the classification procedure due to dimensionality. When m = 40, the accumulated signals compensate for the accumulated noise, so the first two main components still have good recognition capability. When m = 200, the accumulated noise exceeds the signal gain.

The above discussion motivates the use of sparse models and the choice of variables to overcome the effect of noise accumulation. For example, in the classification model (2), instead of using all the functions, we could choose a subset of features that achieve the best signal-to-noise ratio. Such a sparse model provides higher classification efficiency. In other words, the choice of variables plays a key role in overcoming the accumulation of noise in the classification and prediction of regression. However, the choice of variables in large dimensions is challenging due to spurious correlation, random endogeneity, heterogeneity, and measurement errors.

False correlation

The high dimension also contains a false correlation, referring to the fact that many uncorrelated random variables may have high sampling correlations in large dimensions. False correlation can lead to erroneous scientific discoveries and incorrect statistical conclusions.

Consider the problem of estimating the coefficient vector β of the linear model

$$ display $$ y = Xβ + ϵ, Var (ϵ) = σ2Id, y = Xβ + ϵ, Var () = σ2Id, (3) $$ display $$

where y∈Rny∈Rn represents the response vector, X = [x1, ..., xn] T∈Rn × dX = [x1, ..., xn] T∈Rn × d represents the design matrix,, ∈Rnϵ∈Rn represents the independent vector of the random noise and Id is the d × d unit matrix. To cope with the problem of noise accumulation, when size d is comparable to or larger than sample size n, it is considered that only a small number of variables give the answer, that is, β is a sparse vector. In accordance with this rarefaction assumption, the choice of a variable can be made to avoid noise accumulation, improve prediction performance, and improve the interpretability of a model with a conservative representation.

For large sizes, even for such a simple model as (3), the choice of variables is difficult due to the presence of a false correlation. In particular, with a high dimension, important variables can be strongly correlated with several false variables that are not scientifically related. Consider a simple example illustrating this phenomenon. Let x1, ..., xn be independent observations of the d-dimensional Gaussian random vector X = (X1, ..., Xd) T∼Nd (0, Id) X = (X1, ..., Xd) T∼Nd (0, Id) . We repeatedly model data with n = 60 and d = 800 and 6400 1000 times. Figure 2a shows the empirical distribution of the maximum absolute sample correlation coefficient between the first variable, and the others are defined as

$$ display $$ rˆ = maxj≥2 | Corrˆ (X1, Xj) |, r ^ = maxj≥2 | Corr ^ (X1, Xj) |, (4) $$ display $$

where Corr ^ (X1, Xj) Corr ^ (X1, Xj) is the selective correlation between the variables X1 and Xj. We see that the maximum absolute correlation of the sample becomes higher with increasing dimension.

In addition, we can calculate the maximum absolute multiple correlation between X1 and linear combinations of several irrelevant side variables:

$$ display $$ Rˆ = max | S | = 4max {βj} 4j = 1∣∣∣∣Corr (X1, j∈SβjXj) ∣∣∣∣.R ^ = max | S | = 4max {βj} j = 14 | Corr ^ (X1, ∈ j∈SβjXj) |. (5) $$ display $$

Using the standard configuration, the empirical distribution of the maximum absolute coefficient of the sample correlation between X1 and ∑j ∈ SβjXj is given, where S is any fourth-size set of {2, ..., d} and βj is the least squares regression coefficient of Xj by {Xj} regression j ∈ S. Again, we see that, although X1 is completely independent of X2, ..., Xd, the correlation between X1 and the nearest linear combination of any four variables from {Xj} j ≠ 1 to X1 can be very high.

False correlation has a significant impact on the choice of variables and can lead to erroneous scientific discoveries. Let XS = (Xj) j ∈ S be a random vector indexed by S, and let SˆS ^ be the chosen set, which has a higher parasitic correlation with X1, as in Fig. 2. For example, when n = 60 and d = 6400, we see that X1 is practically indistinguishable from XSˆXS ^ for the set SˆS ^ c | Sˆ | = 4 | S ^ | = 4⁠. If X1 represents the level of expression of the gene responsible for the disease, we cannot distinguish it from the other four genes in SˆS ^, which have similar predictive power, although they are scientifically irrelevant.

In addition to the choice of variables, a false correlation can also lead to an incorrect statistical inference. We explain this by considering again the same linear model as in (3). Here we would like to evaluate the standard error σ of the remainder, which is noticeably manifested in the statistical conclusions of the regression coefficients, the choice of model, the test of conformity and the limit regression. Let SˆS ^ be the set of selected variables, and PSˆPS ^ be the projection matrix of the column space XSˆXS ^ ⁠. Standard residual variance estimate based on selected variables:

$$ display $$ σ2 = yT (In − PS) yn− | Sˆ | .σ ^ 2 = yT (In − PS ^) yn− | S ^ |. (6) $$ display $$

Estimator (6) is impartial when variables are not selected by data and the model is correct. However, the situation is completely different when variables are selected based on data. In particular, the authors showed that when there are many false variables, σ2 is seriously underestimated, this leads to erroneous statistical conclusions, including the choice of models or tests of significance, and erroneous scientific discoveries, such as searching for the wrong genes for molecular mechanisms. They also offer an improved cross-validation method to ease the problem.

Random endogeneity

Random endogeneity is another subtle problem arising from the high dimensionality. In the regression setting Y = ∑dj = 1βjXj + εY = ∑j = 1dβjXj + ε⁠, the term “endogeneity” means that some predictors {Xj} correlate with residual noise ε. The usual sparse model suggests

$$ display $$ Y = βjβjXj + ε, and E (εXj) = 0 for j = 1, ..., d, Y = ∑jβjXj + ε, and E (εXj) = 0 for j = 1, ..., d , (7) $$ display $$

with a small set S = {j: βj ≠ 0}. The exogenous assumption (7) that the residual noise ε does not correlate with all predictors is crucial for the accuracy of most existing statistical methods, including the consistency of the choice of variables. Although this assumption looks innocent, it is easy to break in large dimensions, since some variables {Xj} randomly correlate with ε, which makes most multidimensional procedures statistically invalid.

To explain the problem of endogeneity in more detail, suppose that the unknown answer Y is associated with three covariates as follows:

$$ display $$ Y = X1 + X2 + X3 + ε, withEεXj = 0, for j = 1, 2, 3.Y = X1 + X2 + X3 + ε, withEεXj = 0, for j = 1, 2, 3 . $$ display $$

At the data collection stage, we do not know the true model and therefore collect as many covariates as possible, potentially related to Y, in the hope of including all the terms in S in (7). By the way, some of these Xj (for jj 1, 2, 3) can be associated with residual noise ε. This refutes the hypothesis of exogenous modeling in (7). In fact, the more covariates are collected or measured, the more difficult this assumption is.

Unlike spurious correlation, random endogeneity refers to the true existence of correlations between unintended variables. The first is similar to the fact that two people are similar to each other, but do not have a genetic connection, and the second is similar to an acquaintance that easily happens in a big city. More generally, endogeneity results from selection bias, measurement errors and missing variables. These phenomena often occur when analyzing Big Data, mainly for two reasons:

Thanks to the new high-performance measurement methods, scientists can collect as many functions as possible and strive for it. This, accordingly, increases the likelihood that some of them may be correlated with residual noise.
Big Data is usually combined from several sources with potentially different data generation schemes. This increases the likelihood of selection bias and measurement errors, which also cause potential random endogeneity.

Does random endogeneity appear in real data sets and how can we test this in practice? We are considering a genomics study in which 148 microchip samples are loaded from the GEO and ArrayExpress database. These samples were created on the Affymetrix HGU133a platform for people with prostate cancer. The resulting data set contains 22,283 probes, which corresponds to 12,719 genes. In this example, we are interested in a gene called the “discoidin domain receptor family, member 1” (abbreviated DDR1). DDR1 encodes receptor tyrosine kinases, which play an important role in the association of cells with their microenvironment. It is known that DDR1 is closely related to prostate cancer, and we want to explore its relationship with other genes in cancer patients. We took the DDR1 gene expressions as the response variable Y, and the expressions of all the remaining 12,718 genes as predictors. On the left panel of fig. 3 shows the empirical distribution of correlations between the response and individual predictors.

To illustrate the existence of endogeneity, we fit the L1 least squares regression (Lasso) to the data, and the penalty is automatically selected using a 10-fold cross-check (37 genes are selected). Then we restore the usual least squares regression for the selected model to calculate the residual vector. On the right panel fig. 3 we build an empirical distribution of correlations between predictors and residuals. We see that residual noise correlates strongly with many predictors. To make sure that these correlations are not caused by a purely false correlation, we introduce a “zero distribution” of false correlations by randomly rearranging the row orders in the project matrix, so the predictors do not really depend on the residual noise. Comparing these two distributions, we see that the distribution of correlations between predictors and residual noise in the raw data (marked as “raw data”) has a “heavier tail” than in the swapped data (marked as “swapped data”). This result provides strong evidence of endogeneity.

Source: https://habr.com/ru/post/456088/

All Articles