Principal Component Analysis (PCA) is a classical technique of dimension reduction for multivariate data. When the data are a mixture of subjects from different subpopulations one can be interested in PCA of some (or each) subpopulation separately. In this paper estimators are considered for PC directions and corresponding eigenvectors of subpopulations in the nonparametric model of mixture with varying concentrations. Consistency and asymptotic normality of obtained estimators are proved. These results allow one to construct confidence sets for the PC model parameters. Performance of such confidence intervals for the leading eigenvalues is investigated via simulations.

Principal components (PC) analysis is a standard technique of dimension reduction for multivariate data introduced by K. Pearson in 1901 and reinvented by H. Hotelling in the 1933 ([

Classical PCA is developed for homogeneous samples. Real life statistical data is often a mixture of observations from different subpopulations with different distributions of observed variables. Finite mixture models (FMM) are developed to interpret such data. For parametric (normal) FMM the PCA provides a paradigm which allows one to describe and analyze multivariate data distribution of each subpopulation separately in straightforward and intuitive terms. Such an approach is used, e.g., in the

In this paper we consider a modification of PCA for mixtures with varying concentrations (MVC). The MVC is a nonparametric finite mixture model in which the mixing probabilities (the concentrations of the mixture components) vary from observation to observation. Such models arise naturally in statistical analysis of medical [

In this paper we propose estimators for PC directions and corresponding eigenvectors for each component (subpopulation) of the mixture. Asymptotic normality of these estimators allows one to construct confidence sets for the PC parameters.

The rest of the paper is organized as follows. In Section

Here and below for any univariate sample

Let

The first PC direction

It is well known that

Assume that

Now consider a sample of

The weighted empirical distribution of the form

Investigating the asymptotic behavior of the estimators as the sample size

Let

Suppose that the vectors

It is shown in [

There can be some other choices of weights in (

In this section we consider estimation of covariance matrices of the mixture components. Assume that

To estimate

Assume that

There exists nonsingular limit matrix

Then

This theorem is a simple consequence of the theorem 4.2 in [

So, under suitable assumptions,

Let

Assume that the following conditions hold.

The matrix

For all

Then

Let

So

Let us calculate the limit covariance. Observe that

To apply this theorem for the construction of confidence interval or hypotheses testing one needs an estimator for the asymptotic covariances

We define the principal components directions of the

To avoid the ambiguity, we adopt the following rule for choosing the sign of an eigenvector. Consider

Natural estimators for

Let

Let

Assume the following.

The matrix

For all

All the eigenvalues of

Then

The weak convergence in (

Consider the

Consider a continuously differentiable parametric family

Proceeding by the same way with (

Theorem

By Theorem

So, if

To evaluate the finite-sample behavior of the proposed technique, we performed a small simulation study. Confidence intervals for the largest eigenvalue

For each mixture component we present the coverage frequency of the intervals, i.e. the number of confidence intervals which covered the true

In all the experiments the concentrations

The observations

In the

Coverage probabilities of the first experiment

Components | |||

first | second | third | |

250 | 0.973 | 0.872 | 0.956 |

500 | 0.969 | 0.925 | 0.952 |

1000 | 0.962 | 0.968 | 0.953 |

2500 | 0.952 | 0.966 | 0.951 |

5000 | 0.948 | 0.941 | 0.956 |

10000 | 0.955 | 0.960 | 0.953 |

In the

Coverage probabilities of the second experiment

Components | |||

first | second | third | |

250 | 0.971 | 0.839 | 0.959 |

500 | 0.961 | 0.879 | 0.964 |

1000 | 0.964 | 0.919 | 0.955 |

2500 | 0.954 | 0.925 | 0.945 |

5000 | 0.943 | 0.932 | 0.945 |

10000 | 0.950 | 0.921 | 0.947 |

We proposed a technique for estimation of PC directions and eigenvalues by observations from MVC. Asymptotic normality of the estimators is proved. This opens possibilities for constructing confidence sets and testing hypotheses on PC structure of different mixture components. Results of simulations confirm applicability of the asymptotic results for samples of moderate size.

Now let us discuss some challenges which were not answered in this study.

1. To apply Theorem

2. It is sometimes useful in cluster analysis applications to consider FMMs with growing number of components (clusters) as the sample size

3. There are many alternatives to PCA as a dimension reduction technique, e.g., Projection Pursuit (PP) or Independent Components Analysis [

We hope that further study will clarify answers on these questions.

Here we will obtain conditions under which

Let vect be a function which stacks its arguments into a long vector:

Let the assumptions of Theorem

Let

Let

Then

We are thankful to the unknown referees for their attention to our work, fruitful comments and suggestions.