Principal components analysis for mixtures with varying concentrations

Sugakova, Olena; Maiboroda, Rostyslav

doi:10.15559/21-VMSTA191

Abstract

Principal Component Analysis (PCA) is a classical technique of dimension reduction for multivariate data. When the data are a mixture of subjects from different subpopulations one can be interested in PCA of some (or each) subpopulation separately. In this paper estimators are considered for PC directions and corresponding eigenvectors of subpopulations in the nonparametric model of mixture with varying concentrations. Consistency and asymptotic normality of obtained estimators are proved. These results allow one to construct confidence sets for the PC model parameters. Performance of such confidence intervals for the leading eigenvalues is investigated via simulations.

1 Introduction

Principal components (PC) analysis is a standard technique of dimension reduction for multivariate data introduced by K. Pearson in 1901 and reinvented by H. Hotelling in the 1933 ([7], section 1.2). The first PC direction is the direction of the highest scattering of the data cloud and the first eigenvalue corresponding to it is the variance of the data projections on this direction. First two or three PC scores are usually used to visualize multidimensional data ([4], chapter 9).The orthogonal regression estimator for coefficients of a linear regression model is represented through the least PC direction (corresponding to the smallest eigenvalue, see (2.23) in [15]).

Classical PCA is developed for homogeneous samples. Real life statistical data is often a mixture of observations from different subpopulations with different distributions of observed variables. Finite mixture models (FMM) are developed to interpret such data. For parametric (normal) FMM the PCA provides a paradigm which allows one to describe and analyze multivariate data distribution of each subpopulation separately in straightforward and intuitive terms. Such an approach is used, e.g., in the R package mclust [14].

In this paper we consider a modification of PCA for mixtures with varying concentrations (MVC). The MVC is a nonparametric finite mixture model in which the mixing probabilities (the concentrations of the mixture components) vary from observation to observation. Such models arise naturally in statistical analysis of medical [9] and sociological [12] data. A technique of neuronal activity analysis based on the MVC approach is considered in [13]. See also [2] for adaptive estimation and [1] for adaptive hypotheses testing in MVC models.

In this paper we propose estimators for PC directions and corresponding eigenvectors for each component (subpopulation) of the mixture. Asymptotic normality of these estimators allows one to construct confidence sets for the PC parameters.

The rest of the paper is organized as follows. In Section 2 we give a brief exposition of the classical PC analysis. Section 3 contains general description of the MVC model. In Section 4 we present an estimator for the covariance matrices of the mixture components and derive its asymptotic normality. Section 5 is devoted to the estimators of PC directions and eigenvalues and their asymptotic normality. In Section 6 we apply these results to construction of confidence intervals for the eigenvalues. Section 7 contains results of simulations. The results and further development are discussed in Section 8. One technical result is placed in the Appendix.

2 Classical principal component analysis

Here and below for any univariate sample $\mathbf{x}=({x_{1}},\dots ,{x_{n}})$,

\[ \bar{\mathbf{x}}=\frac{1}{n}{\sum \limits_{j=1}^{n}}{x_{j}}\]

denotes the sample mean,

\[ {S^{2}}(\mathbf{x})=\frac{1}{n}{\sum \limits_{j=1}^{n}}{({x_{j}}-\bar{\mathbf{x}})^{2}}\]

is the sample variance of x, and $|\mathbf{v}|$ denotes the Euclidean norm of v.

Let $\mathbb{X}\hspace{-0.1667em}=\hspace{-0.1667em}({\mathbf{X}_{1}},\dots ,{\mathbf{X}_{n}})$ be a sample of d-dimensional vectors ${\mathbf{X}_{j}}\hspace{-0.1667em}=\hspace{-0.1667em}{({X_{j}^{1}},\dots ,{X_{j}^{d}})^{T}}$, $j=1,\dots ,n$,

\[ {\mathbf{u}^{T}}\mathbb{X}=({\mathbf{u}^{T}}{\mathbf{X}_{1}},\dots ,{\mathbf{u}^{T}}{\mathbf{X}_{n}}).\]

The first PC direction ${\mathbf{v}_{1}}=\mathbf{v}(1;\mathbb{X})$ of the sample $\mathbb{X}$ is the vector in ${\mathbb{R}^{d}}$ of unit length such that

\[ {S^{2}}({\mathbf{v}_{1}^{T}}\mathbb{X})=\max \{{S^{2}}({\mathbf{u}^{T}}\mathbb{X})\hspace{2.5pt}:\mathbf{u}\in {\mathbb{R}^{d}},|\mathbf{u}|=1\}.\]

Then, for $l=2,\dots ,d$ the l-th PC direction is the vector ${\mathbf{v}_{l}}=\mathbf{v}(l;\mathbb{X})$ of unit length such that

\[ {S^{2}}({\mathbf{v}_{l}^{T}}\mathbb{X})\hspace{-0.1667em}=\hspace{-0.1667em}\max \{{S^{2}}({\mathbf{u}^{T}}\mathbb{X}):\mathbf{u}\in {\mathbb{R}^{d}},|\mathbf{u}|=1,{\mathbf{u}^{T}}\mathbf{v}(1;\mathbb{X})\hspace{-0.1667em}=\hspace{-0.1667em}0,\dots ,{\mathbf{u}^{T}}\mathbf{v}(l-1;\mathbb{X})=0\}.\]

So, the first PC direction is the direction of maximal scattering of a data cloud, the second one is the direction orthogonal to the first one in which the scattering is maximal, and so on.

It is well known that $\mathbf{v}(1,\mathbb{X}),\mathbf{v}(2,\mathbb{X}),\dots ,\mathbf{v}(d,\mathbb{X})$ are the eigenvectors of the sample covariance matrix ${\hat{\mathbf{C}}_{n}}=\operatorname{Cov}(\mathbb{X})$ corresponding to its eigenvalues

\[ \lambda (1;\mathbb{X})\ge \lambda (2;\mathbb{X})\ge \cdots \ge \lambda (d;\mathbb{X}),\]

i.e.

\[ {\hat{\mathbf{C}}_{n}}\mathbf{v}(l;\mathbb{X})=\lambda (l;\mathbb{X})\mathbf{v}(l;\mathbb{X}).\]

Note that

\[ \lambda (l;\mathbb{X})={S^{2}}(\mathbf{v}{(l;\mathbb{X})^{T}}\mathbb{X}).\]

If all the eigenvalues are different, then the PC directions are defined unambiguously (up to multiplication by $\pm 1$).

Assume that ${\mathbf{X}_{j}}$ are i.i.d. random vectors with a distribution F, i.e. $\operatorname{\mathsf{P}}\{{\mathbf{X}_{j}}\in A\}=F(A)$ for all Borel sets $A\subseteq {\mathbb{R}^{d}}$. Then the PC directions and corresponding eigenvalues of the sample can be interpreted as estimators of the true theoretical PC directions $\mathbf{v}(l;F)$ and eigenvalues $\lambda (l;F)$ which are the eigenvalues and eigenvectors of the covariance of a random vector X with the distribution F:

\[\begin{array}{l}\displaystyle \mathbf{C}=\mathbf{C}(F)=\operatorname{\mathsf{E}}(\mathbf{X}-\operatorname{\mathsf{E}}\mathbf{X}){(\mathbf{X}-\operatorname{\mathsf{E}}\mathbf{X})^{T}},\\ {} \displaystyle \mathbf{C}\mathbf{v}(l;F)=\lambda (l;F)\mathbf{v}(l;F),\hspace{2.5pt}l=1,\dots ,d.\end{array}\]

These theoretical PC directions possess optimal qualities similar to the sample PC. E.g., the projection of X on the first PC direction is of maximal variance:

\[ \operatorname{Var}(\mathbf{v}{(1,F)^{T}}\mathbf{X})=\max \{\operatorname{Var}({\mathbf{u}^{T}}\mathbf{X})\hspace{2.5pt}:\hspace{2.5pt}\mathbf{u}\in {\mathbb{R}^{d}},|\mathbf{u}|=1\}.\]

3 Mixtures with varying concentrations

Now consider a sample of n subjects taken from M different subpopulations (mixture components). For the j-th subject the vector of d observed variables is denoted by ${\mathbf{X}_{j}}={({X_{j}^{1}},\dots ,{X_{j}^{d}})^{T}}$. The true number of the component, to which the j-th subject belongs, is denoted by ${\kappa _{j}}\in \{1,\dots ,M\}$. This number is not observed, but one knows the probabilities

\[ {p_{j}^{m}}={p_{j;n}^{m}}=\operatorname{\mathsf{P}}\{{\kappa _{j}}=m\},\hspace{2.5pt}m=1,\dots ,M.\]

Distribution of ${\mathbf{X}_{j}}$ depends on ${\kappa _{j}}$:

\[ {F_{m}}(A)=\operatorname{\mathsf{P}}\{{X_{j}}\in A\hspace{2.5pt}|\hspace{2.5pt}{\kappa _{j}}=m\}.\]

So the unconditional distribution of the observed ${\mathbf{X}_{j}}$ is a mixture of M components’ distributions

(1)

\[ \operatorname{\mathsf{P}}\{{\mathbf{X}_{j}}\in A\}={\sum \limits_{m=1}^{M}}{p_{j}^{m}}{F_{m}}(A).\]

We assume that $({\mathbf{X}_{j}},{\kappa _{j}})$ are independent for different $j=1,\dots ,n$. The formula (1) is called the model of mixture with varying concentrations (MVC model). Note that the components’ distributions ${F_{m}}$ are completely unknown and the concentrations ${p_{j}^{m}}$ are known in this model.

The weighted empirical distribution of the form

(2)

\[ {\hat{F}_{m;n}}(A)={\sum \limits_{j=1}^{n}}{w_{j}^{m}}\mathbb{1}\{{\mathbf{X}_{j}}\in A\}\]

can be used to estimate ${F_{m}}(A)$. Here ${w_{j}^{m}}$ are some weights constructed from the concentrations ${p_{j}^{k}}$, $j=1,\dots ,n$, $k=1,\dots ,M$. These weights are aimed to pick out the m-th component and to suppress the influence of all other components on the estimator.

Investigating the asymptotic behavior of the estimators as the sample size n tends to infinity we will consider different ${p_{j}^{m}}$ and ${w_{j}^{m}}$ for different n. Sometimes it will be denoted by the subscript ${_{;n}}$: ${p_{j;n}^{m}}$, ${w_{j;n}^{m}}$, ${\mathbf{X}_{j;n}}$. If this subscript is dropped it means that we consider here a sample of fixed size n.

Let

\[ {\mathbf{p}_{;n}^{m}}={({p_{1;n}^{m}},\dots ,{p_{n;n}^{m}})^{T}},{\mathbf{p}_{;n}}=({\mathbf{p}_{;n}^{1}},\dots {\mathbf{p}_{;n}^{M}}),\]

i.e. ${\mathbf{p}_{;n}}$ denotes a matrix of concentrations with M columns and n rows. Each column corresponds to a mixture component, each row corresponds to an observation. Similar notations ${\mathbf{w}_{;n}}$, ${\mathbf{w}_{;n}^{m}}$ will be used for the weights.

Suppose that the vectors ${\mathbf{p}_{;n}^{m}}$, $m=1,\dots ,M$, are linearly independent. Then the matrix ${\boldsymbol{\Gamma }_{;n}}=({\mathbf{p}_{;n}^{T}}{\mathbf{p}_{;n}})$ is nonsingular. We will use the weights

(3)

\[ {\mathbf{w}_{;n}}={\boldsymbol{\Gamma }_{;n}^{-1}}{\mathbf{p}_{;n}}.\]

It is shown in [9] that ${\hat{F}_{m;n}}$ defined by (2) with the weights ${\mathbf{w}_{;n}^{m}}$ defined by (3) is a minimax estimator for ${F_{m}}$ with respect to the quadratic loss. So the weights (3) are called the minimax weights.

There can be some other choices of weights in (2). E.g., in [10] an adaptive approach is proposed which allows to obtain asymptotically optimal estimators of MVC model parameters. Unfortunately, the adaptive estimators need samples with quite large number observations to outperform the minimax ones in MVC models. Especially large samples are needed for multivariate data analysis which is just the case when PCA is most useful. So, in this paper we will consider estimators with the minimax weights only.

4 Estimation of covariance matrices

In this section we consider estimation of covariance matrices of the mixture components. Assume that $\operatorname{\mathsf{E}}[|{\mathbf{X}_{j}}{|^{2}}\hspace{2.5pt}|\hspace{2.5pt}{\kappa _{j}}=m]<\infty $ for all $m=1,\dots ,M$ and let

(4)

\[\begin{array}{l}\displaystyle \mu (i;m)=\operatorname{\mathsf{E}}[{X_{j}^{i}}\hspace{2.5pt}|\hspace{2.5pt}{\kappa _{j}}=m],c({i_{1}},{i_{2}};m)=\operatorname{\mathsf{E}}[{X_{j}^{{i_{1}}}}{X_{j}^{{i_{2}}}}-\mu ({i_{1}};m)\mu ({i_{2}};m)\hspace{2.5pt}|\hspace{2.5pt}{\kappa _{j}}=m],\\ {} \displaystyle {\mathbf{C}_{m}}={(c({i_{1}},{i_{2}};m))_{{i_{1}},{i_{2}}=1}^{d}}.\end{array}\]

So, ${\mathbf{C}_{m}}$ is the covariance matrix of a random vector with the distribution ${F_{m}}$. (The m-th component covariance).

To estimate $\mu (i;m)$ and $c({i_{1}},{i_{2}};m)$ one can use weighted means with weights designed for the estimation of ${F_{m}}$. Say,

(5)

\[ {\hat{\mu }_{;n}}(i;m)={\sum \limits_{j=1}^{n}}{w_{j;n}^{m}}{X_{j}^{i}},\]

(6)

\[\begin{array}{l}\displaystyle {\hat{c}_{;n}}({i_{1}},{i_{2}};m)={\sum \limits_{j=1}^{n}}{w_{j;n}^{m}}{X_{j}^{{i_{1}}}}{X_{j}^{{i_{2}}}}-{\hat{\mu }_{;n}}({i_{1}};m){\hat{\mu }_{;n}}({i_{2}};m),\\ {} \displaystyle {\hat{\mathbf{C}}_{m;n}}={(\hat{c}({i_{1}},{i_{2}};m))_{{i_{1}},{i_{2}}=1}^{d}}.\end{array}\]

Theorem 1.

Assume that

(i) $\operatorname{\mathsf{E}}[|{\mathbf{X}_{j}}{|^{2}}\hspace{2.5pt}|\hspace{2.5pt}{\kappa _{j}}=m]<\infty $ for all $m=1,\dots ,M$;
(ii) There exists nonsingular limit matrix

(7)
\[ {\boldsymbol{\Gamma }_{\infty }}=\underset{n\to \infty }{\lim }\frac{1}{n}{\boldsymbol{\Gamma }_{;n}}.\]

Then ${\hat{\mathbf{C}}_{m;n}}\to {\mathbf{C}_{m}}$ in probability.

Proof.

This theorem is a simple consequence of the theorem 4.2 in [9]. □

So, under suitable assumptions, ${\hat{\mathbf{C}}_{m;n}}$ is a consistent estimator for the m-th component covariance matrix. To establish its asymptotic normality we need some additional notations.

Let

(8)

\[ \begin{array}{r}\displaystyle {\langle {\mathbf{w}^{{k_{1}}}}{\mathbf{w}^{{k_{2}}}}{\mathbf{p}^{{m_{1}}}}{\mathbf{p}^{{m_{2}}}}\rangle _{;n}}=n{\sum \limits_{j=1}^{n}}{w_{j;n}^{{k_{1}}}}{w_{j;n}^{{k_{2}}}}{p_{j;n}^{{m_{1}}}}{p_{j;n}^{{m_{2}}}},\\ {} \displaystyle \langle {\mathbf{w}^{{k_{1}}}}{\mathbf{w}^{{k_{2}}}}{\mathbf{p}^{{m_{1}}}}{\mathbf{p}^{{m_{2}}}}\rangle =\underset{n\to \infty }{\lim }{\langle {\mathbf{w}^{{k_{1}}}}{\mathbf{w}^{{k_{2}}}}{\mathbf{p}^{{m_{1}}}}{\mathbf{p}^{{m_{2}}}}\rangle _{;n}},\end{array}\]

(9)

\[ {\langle {\mathbf{w}^{{k_{1}}}}{\mathbf{w}^{{k_{2}}}}{\mathbf{p}^{m}}\rangle _{;n}}=n{\sum \limits_{j=1}^{n}}{w_{j;n}^{{k_{1}}}}{w_{j;n}^{{k_{2}}}}{p_{j;n}^{m}},\hspace{2.5pt}\langle {\mathbf{w}^{{k_{1}}}}{\mathbf{w}^{{k_{2}}}}{\mathbf{p}^{m}}\rangle =\underset{n\to \infty }{\lim }{\langle {\mathbf{w}^{{k_{1}}}}{\mathbf{w}^{{k_{2}}}}{\mathbf{p}^{m}}\rangle _{;n}},\]

assuming that these limits exist. Then

(10)

\[\begin{array}{l}\displaystyle {\eta _{j}}({i_{1}},{i_{2}};k)={X_{j}^{{i_{1}}}}{X_{j}^{{i_{2}}}}-{X_{j}^{{i_{1}}}}\mu ({i_{2}};k)-{X_{j}^{{i_{2}}}}\mu ({i_{1}};k),\\ {} \displaystyle {M_{1}}({i_{1}},{i_{2}};k;m)=\operatorname{\mathsf{E}}[{\eta _{j}}({i_{1}},{i_{2}};k)\hspace{2.5pt}|\hspace{2.5pt}{\kappa _{j}}=m],\\ {} \displaystyle {M_{2}}({i_{1}},{i_{2}},{i_{3}},{i_{4}};{k_{1}},{k_{2}};m)=\operatorname{\mathsf{E}}[{\eta _{j}}({i_{1}},{i_{2}};{k_{1}}){\eta _{j}}({i_{3}},{i_{4}};{k_{2}})\hspace{2.5pt}|\hspace{2.5pt}{\kappa _{j}}=m],\\ {} \displaystyle \begin{aligned}{}V& ({i_{1}},{i_{2}},{i_{3}},{i_{4}};{k_{1}},{k_{2}})={\sum \limits_{m=1}^{M}}\langle {\mathbf{w}^{{k_{1}}}}{\mathbf{w}^{{k_{2}}}}{\mathbf{p}^{m}}\rangle {M_{2}}({i_{1}},{i_{2}},{i_{3}},{i_{4}};{k_{1}},{k_{2}};m)\\ {} & -{\sum \limits_{{m_{1}},{m_{2}}=1}^{M}}\langle {\mathbf{w}^{{k_{1}}}}{\mathbf{w}^{{k_{2}}}}{\mathbf{p}^{{m_{1}}}}{\mathbf{p}^{{m_{2}}}}\rangle {M_{1}}({i_{1}},{i_{2}},;{k_{1}};{m_{1}}){M_{1}}({i_{3}},{i_{4}},;{k_{2}};{m_{2}}).\end{aligned}\end{array}\]

Theorem 2.

Assume that the following conditions hold.

(i) $\operatorname{\mathsf{E}}[|{\mathbf{X}_{j}}{|^{4}}\hspace{2.5pt}|\hspace{2.5pt}{\kappa _{j}}=m]<\infty $ for all $m=1,\dots ,M$.
(ii) The matrix ${\boldsymbol{\Gamma }_{\infty }}$ defined by (7) exists and is nonsingular.
(iii) For all ${m_{1}},{m_{2}},{k_{1}},{k_{2}}=1,\dots ,M$ there exist $\langle {\mathbf{w}^{{k_{1}}}}{\mathbf{w}^{{k_{2}}}}{\mathbf{p}^{{m_{1}}}}{\mathbf{p}^{{m_{2}}}}\rangle $ defined by (8).

Then

\[ \sqrt{n}({\hat{\mathbf{C}}_{m;n}}-{\mathbf{C}_{m}})\stackrel{\text{W}}{\longrightarrow }{\mathbf{Z}_{m}},\hspace{2.5pt}\text{for}\hspace{2.5pt}m=1,\dots ,M,\]

where ${\mathbf{Z}_{m}}={(z({i_{1}},{i_{2}};m))_{{i_{1}},{i_{2}}=1}^{d}}$, $m=1,\dots ,M$ is a set of matrices with zero mean entries and the covariance structure

\[ \operatorname{\mathsf{E}}z({i_{1}},{i_{2}};{k_{1}})z({i_{3}},{i_{4}};{k_{2}})=V({i_{1}},{i_{2}},{i_{3}},{i_{4}};{k_{1}},{k_{2}}),\]

for ${j_{1}},{j_{2}},{j_{3}},{j_{4}}=1,\dots ,d$, ${k_{1}},{k_{2}}=1,\dots ,M$.

Proof.

Let

(11)

\[\begin{array}{l}\displaystyle {z_{;n}}({i_{1}},{i_{2}};k)=\sqrt{n}({\hat{c}_{;n}}({i_{1}},{i_{2}};k)-c({i_{1}},{i_{2}};k)),\\ {} \displaystyle {\tilde{z}_{;n}}({i_{1}},{i_{2}};k)=\sqrt{n}{\sum \limits_{j=1}^{n}}{w_{j;n}^{k}}[{\eta _{j}}({i_{1}},{i_{2}};k)-\operatorname{\mathsf{E}}{\eta _{j}}({i_{1}},{i_{2}};k)].\end{array}\]

By somewhat tedious but straightforward algebra one obtains

\[\begin{aligned}{}{z_{;n}}({i_{1}},{i_{2}};k)& -{\tilde{z}_{;n}}({i_{1}},{i_{2}};k)\\ {} =& \sqrt{n}\bigg({\sum \limits_{j=1}^{n}}{w_{j;n}^{k}}{X_{j}^{{i_{1}}}}-\mu ({i_{1}};k)\bigg)\bigg({\sum \limits_{j=1}^{n}}{w_{j;n}^{k}}{X_{j}^{{i_{2}}}}-\mu ({i_{2}};k)\bigg).\end{aligned}\]

Observe that

\[ \operatorname{\mathsf{E}}\bigg[{\sum \limits_{j=1}^{n}}{w_{j;n}^{k}}{X_{j}^{i}}\bigg]={\sum \limits_{j=1}^{n}}{\sum \limits_{m=1}^{M}}{p_{j:n}^{m}}{w_{j;n}^{k}}\mu (i;m)=\mu (i;k),\]

since ${\textstyle\sum _{j=1}^{n}}{p_{j:n}^{m}}{w_{j;n}^{k}}$ equals 1 if $m=k$, and 0 otherwise.

\[ \operatorname{\mathsf{E}}{\bigg[{\sum \limits_{j=1}^{n}}{w_{j;n}^{k}}{X_{j}^{i}}-\mu (i;k)\bigg]^{2}}={\sum \limits_{j=1}^{n}}{({w_{j;n}^{k}})^{2}}\operatorname{Var}{X_{j}^{i}}.\]

Due to the assumption (ii), ${\sup _{j,n}}\operatorname{Var}{X_{j}^{i}}<\infty $. By lemma 1 in [11], ${\sup _{j,n}}|{w_{j;n}^{k}}|=O({n^{-1}})$, so, by the Chebyshev inequality,

\[ {\sum \limits_{j=1}^{n}}{w_{j;n}^{k}}{X_{j}^{i}}-\mu (i;k)={O_{P}}({n^{-1/2}})\]

and

\[ {z_{;n}}({i_{1}},{i_{2}};k)-{\tilde{z}_{;n}}({i_{1}},{i_{2}};k)={o_{P}}(1).\]

So it is enough to prove the statement of the Theorem for ${\tilde{z}_{;n}}({i_{1}},{i_{2}};k)$ instead of ${z_{;n}}({i_{1}},{i_{2}};k)$. Asymptotic normality of the set $({\tilde{z}_{;n}}({i_{1}},{i_{2}};k),{i_{1}},{i_{2}}=1,\dots ,d,k=1,\dots ,M)$ can be proved applying the Central Limit Theorem with the Lindeberg’s condition by the same way as in Theorem 4.3. in [9].

Let us calculate the limit covariance. Observe that

\[ \operatorname{\mathsf{E}}{\eta _{j}}({i_{1}},{i_{2}};k)={\sum \limits_{m=1}^{M}}{p_{j;n}^{m}}{M_{1}}({i_{1}},{i_{2}};k;m)\]

and

\[ \operatorname{\mathsf{E}}{\eta _{j}}({i_{1}},{i_{2}};{k_{1}}){\eta _{j}}({i_{3}},{i_{4}};{k_{2}})={\sum \limits_{m=1}^{M}}{p_{j;n}^{m}}{M_{2}}({i_{1}},{i_{2}},{i_{3}},{i_{4}};{k_{1}},{k_{2}};m).\]

\[\begin{aligned}{}\operatorname{\mathsf{E}}{\tilde{z}_{;n}}& ({i_{1}},{i_{2}};{k_{1}}){\tilde{z}_{;n}}({i_{3}},{i_{4}};{k_{2}})=\\ {} & =n{\sum \limits_{j=1}^{n}}{w_{j;n}^{{k_{1}}}}{w_{j;n}^{{k_{2}}}}\operatorname{\mathsf{E}}{\eta _{j}}({i_{1}},{i_{2}};{k_{1}}){\eta _{j}}({i_{3}},{i_{4}};{k_{2}})\\ {} & -n{\sum \limits_{j=1}^{n}}{w_{j;n}^{{k_{1}}}}{w_{j;n}^{{k_{2}}}}\operatorname{\mathsf{E}}{\eta _{j}}({i_{1}},{i_{2}};{k_{1}})\operatorname{\mathsf{E}}{\eta _{j}}({i_{3}},{i_{4}};{k_{2}})\\ {} & ={\sum \limits_{m=1}^{M}}{\langle {\mathbf{w}^{{k_{1}}}}{\mathbf{w}^{{k_{2}}}}{\mathbf{p}^{m}}\rangle _{;n}}{M_{2}}({i_{1}},{i_{2}},{i_{3}},{i_{4}};{k_{1}},{k_{2}};m)\\ {} & -{\sum \limits_{{m_{1}},{m_{2}}=1}^{M}}{\langle {\mathbf{w}^{{k_{1}}}}{\mathbf{w}^{{k_{2}}}}{\mathbf{p}^{{m_{1}}}}{\mathbf{p}^{{m_{2}}}}\rangle _{;n}}{M_{1}}({i_{1}},{i_{2}},;{k_{1}};{m_{1}}){M_{1}}({i_{3}},{i_{4}},;{k_{2}};{m_{2}}).\end{aligned}\]

Note that, since ${\textstyle\sum _{m=1}^{M}}{p_{j;n}^{m}}=1$, assumption (iii) implies that

\[ {\langle {\mathbf{w}^{{k_{1}}}}{\mathbf{w}^{{k_{2}}}}{\mathbf{p}^{m}}\rangle _{;n}}={\sum \limits_{k=1}^{M}}{\langle {\mathbf{w}^{{k_{1}}}}{\mathbf{w}^{{k_{2}}}}{\mathbf{p}^{m}}{\mathbf{p}^{k}}\rangle _{;n}}\to \langle {\mathbf{w}^{{k_{1}}}}{\mathbf{w}^{{k_{2}}}}{\mathbf{p}^{m}}\rangle ={\sum \limits_{k=1}^{M}}\langle {\mathbf{w}^{{k_{1}}}}{\mathbf{w}^{{k_{2}}}}{\mathbf{p}^{m}}{\mathbf{p}^{k}}\rangle \]

as $n\to \infty $. So

\[ \operatorname{\mathsf{E}}{\tilde{z}_{;n}}({i_{1}},{i_{2}};{k_{1}}){\tilde{z}_{;n}}({i_{3}},{i_{4}};{k_{2}})\to V({i_{1}},{i_{2}},{i_{3}},{i_{4}};{k_{1}},{k_{2}}).\]

The Theorem is proved. □

To apply this theorem for the construction of confidence interval or hypotheses testing one needs an estimator for the asymptotic covariances $V({i_{1}},{i_{2}},{i_{3}},{i_{4}};{k_{1}},{k_{2}})$. To obtain it, let us consider

\[ {\hat{\eta }_{j;n}}({i_{1}},{i_{2}};k)={X_{j}^{{i_{1}}}}{X_{j}^{{i_{2}}}}-{X_{j}^{{i_{1}}}}{\hat{\mu }_{;n}}({i_{2}};k)-{X_{j}^{{i_{2}}}}{\hat{\mu }_{;n}}({i_{1}};k).\]

Observe that, under the assumptions of Theorem 2,

\[ {\widehat{M}_{1;n}}({i_{1}},{i_{2}};k;m)={\sum \limits_{j=1}^{n}}{w_{j;n}^{m}}{\hat{\eta }_{j;n}}({i_{1}},{i_{2}};k)\]

and

\[ {\widehat{M}_{2;n}}({i_{1}},{i_{2}},{i_{3}},{i_{4}};{k_{1}},{k_{2}};m)={\sum \limits_{j=1}^{n}}{w_{j;n}^{m}}{\hat{\eta }_{j;n}}({i_{1}},{i_{2}};{k_{1}}){\hat{\eta }_{j;n}}({i_{3}},{i_{4}};{k_{2}})\]

are consistent estimators to ${M_{1}}({i_{1}},{i_{2}};k;m)$ and ${M_{2}}({i_{1}},{i_{2}},{i_{3}},{i_{4}};{k_{1}},{k_{2}};m)$ respectively. So

(12)

\[ \begin{aligned}{}{\widehat{V}_{;n}}& ({i_{1}},{i_{2}},{i_{3}},{i_{4}};{k_{1}},{k_{2}})={\sum \limits_{m=1}^{M}}{\langle {\mathbf{w}^{{k_{1}}}}{\mathbf{w}^{{k_{2}}}}{\mathbf{p}^{m}}\rangle _{;n}}{\widehat{M}_{2;n}}({i_{1}},{i_{2}},{i_{3}},{i_{4}};{k_{1}},{k_{2}};m)\\ {} & -{\sum \limits_{{m_{1}},{m_{2}}=1}^{M}}{\langle {\mathbf{w}^{{k_{1}}}}{\mathbf{w}^{{k_{2}}}}{\mathbf{p}^{{m_{1}}}}{\mathbf{p}^{{m_{2}}}}\rangle _{;n}}{\widehat{M}_{1;n}}({i_{1}},{i_{2}},;{k_{1}};{m_{1}}){\widehat{M}_{1;n}}({i_{3}},{i_{4}},;{k_{2}};{m_{2}}).\end{aligned}\]

is a consistent estimator to $V({i_{1}},{i_{2}},{i_{3}},{i_{4}};{k_{1}},{k_{2}})$.

5 Principal components for mixtures

We define the principal components directions of the k-th mixture component as the eigenvectors of ${\mathbf{C}_{k}}$. Let $\lambda (1;k)>\lambda (2;k)>\cdots \lambda (d;k)$ be the eigenvalues of ${\mathbf{C}_{k}}$ and $\mathbf{v}(l;k)={({v^{1}}(l;k),\dots ,{v^{d}}(l;k))^{T}}$ be the corresponding eigenvectors:

(13)

\[ {\mathbf{C}_{k}}\mathbf{v}(l;k)=\lambda (l;k)\mathbf{v}(l;k).\]

In what follows we assume that all the eigenvalues of ${\mathbf{C}_{k}}$ are simple (i.e. there are d different eigenvalues) and $|\mathbf{v}(l;k)|=1$. Then these vectors are defined unambiguously up to the sign multiplier $\pm 1$: if (13) holds for $\mathbf{v}(l;k)$ then $-\mathbf{v}(l;k)$ also satisfy it.

To avoid the ambiguity, we adopt the following rule for choosing the sign of an eigenvector. Consider $v={\max _{i=1,\dots ,d}}|{v^{i}}(l;k)|$ and ${i_{0}}=\min \{i\hspace{2.5pt}:|{v^{i}}(l;k)|=v\}$. We choose as the PC direction the version of $\mathbf{v}(l;k)$ for which ${v^{{i_{0}}}}(l;k)>0$.

Natural estimators for $\lambda (l;k)$ and $\mathbf{v}(l;k)$ are the eigenvalues and eigenvectors of ${\hat{\mathbf{C}}_{k;n}}$. Let ${\hat{\lambda }_{;n}}(l;k)$ denote the l-th (in the decreasing order) eigenvalue of ${\hat{\mathbf{C}}_{k;n}}$. To choose the sign of the corresponding estimated eigenvector ${\hat{\mathbf{v}}_{;n}}(l;k)=({\hat{v}_{;n}^{1}}(l;k),\dots ,{\hat{v}_{;n}^{d}}{(l;k))^{T}}$ we need somewhat more complicated algorithm than in the case of $\mathbf{v}(l;k)$.

Let ${\varepsilon _{n}}>0$ be some sequence such that ${\varepsilon _{n}}\to 0$ as $n\to \infty $. Consider ${\hat{v}_{;n}}={\max _{i=1,\dots ,d}}|{\hat{v}_{;n}^{i}}(l;k)|$ and ${\hat{i}_{0}}=\min \{i\hspace{2.5pt}:|{\hat{v}^{i}}(l;k)|\ge {\hat{v}_{;n}}-{\varepsilon _{n}}\}$. Then we choose the sign of ${\hat{\mathbf{v}}_{;n}}(l;k)$ so that ${\hat{v}_{;n}^{{\hat{i}_{0}}}}>0$.

Let $\mathbb{E}$ be the unit $d\times d$-matrix and ${\mathbf{A}^{+}}$ denotes the Moore–Penrose inverse of a matrix A.

Theorem 3.

Assume the following.

(i) $\operatorname{\mathsf{E}}[|{\mathbf{X}_{j}}{|^{4}}\hspace{2.5pt}|\hspace{2.5pt}{\kappa _{j}}=m]<\infty $ for all $m=1,\dots ,M$.
(ii) The matrix ${\boldsymbol{\Gamma }_{\infty }}$ defined by (7) exists and is nonsingular.
(iii) For all ${m_{1}},{m_{2}},{k_{1}},{k_{2}}=1,\dots ,M$ there exist $\langle {\mathbf{w}^{{k_{1}}}}{\mathbf{w}^{{k_{2}}}}{\mathbf{p}^{{m_{1}}}}{\mathbf{p}^{{m_{2}}}}\rangle $ defined by (8).
(iv) All the eigenvalues of ${\mathbf{C}_{k}}$ are simple.
(v) ${\varepsilon _{n}}\to 0$, $\sqrt{n}{\varepsilon _{n}}\to \infty $ as $n\to \infty $.

Then

(14)

\[ \sqrt{n}({\hat{\lambda }_{;n}}(l;k))-\lambda (l;k))\stackrel{\text{W}}{\longrightarrow }{\mathbf{v}^{T}}(l;k){\mathbf{Z}_{k}}\mathbf{v}(l;k),\]

(15)

\[ \sqrt{n}({\hat{\mathbf{v}}_{;n}}(l;k))-\mathbf{v}(l;k))\stackrel{\text{W}}{\longrightarrow }{({\mathbf{C}_{k}}-\lambda (l;k)\mathbb{E})^{+}}{\mathbf{Z}_{k}}\mathbf{v}(l;k),\]

as $n\to \infty $, for all $l=1,\dots ,d$, where ${\mathbf{Z}_{k}}$ is defined in Theorem 2.

Remark.

The weak convergence in (14)–(15) is simultaneous. I.e. the common distribution of the set of LHS for $l=1,\dots ,d$ in (14)–(15) converges to the common distribution of RHS. Moreover, if assumption (iv) holds for some set of k, then the convergence is simultaneous for these k.

Proof.

Consider the l-th eigenvalue ${\lambda _{l}}(\mathbf{C})$ of a symmetric matrix $\mathbf{C}\in {\mathbb{R}^{d\times d}}$ as a function of C. Since all the eigenvalues of ${\mathbf{C}_{k}}$ are simple, there exists a neighborhood ${N_{k}}$ of ${\mathbf{C}_{k}}$ at which ${\lambda _{l}}(\mathbf{C})$ is continuous. It is well known that this is a continuously differentiable function of C. There exist also two versions $\pm {\mathbf{v}_{l}}(\mathbf{C})$ of the l-th eigenvector of C each of which is continuously differentiable in ${N_{k}}$. We choose the version ${\mathbf{v}_{l}}(\mathbf{C})$ for which ${\mathbf{v}_{l}}({\mathbf{C}_{k}})=\mathbf{v}(l,k)$.

Consider a continuously differentiable parametric family ${\mathbf{C}_{t}}$, $t\in (a,b)\subset \mathbb{R}$, of $d\times d$ symmetric matrices with simple eigenvalues. Differentiating the equations

\[ {\mathbf{C}_{t}}{\mathbf{v}_{l}}({\mathbf{C}_{t}})={\lambda _{l}}({\mathbf{C}_{t}}){\mathbf{v}_{l}}({\mathbf{C}_{t}})\]

and

\[ {({\mathbf{v}_{l}}({\mathbf{C}_{t}}))^{T}}{\mathbf{v}_{l}}({\mathbf{C}_{t}})=1,\]

one obtains

(16)

\[ \frac{d}{dt}{\lambda _{l}}({\mathbf{C}_{t}})={({\mathbf{v}_{l}}({\mathbf{C}_{t}}))^{T}}\left(\frac{d}{dt}{\mathbf{C}_{t}}\right){\mathbf{v}_{l}}({\mathbf{C}_{t}}),\]

(17)

\[ \frac{d}{dt}{\mathbf{v}_{l}}({\mathbf{C}_{t}})={({\mathbf{C}_{t}}-{\lambda _{l}}({\mathbf{C}_{t}})\mathbb{E})^{+}}\left(\frac{d}{dt}{\mathbf{C}_{t}}\right){\mathbf{v}_{l}}({\mathbf{C}_{t}})\]

(Theorem 8.9 in [8]). By the Taylor expansion

\[\begin{array}{l}\displaystyle {\hat{\lambda }_{;n}}(l;k))-\lambda (l;k)={\lambda _{l}}({\hat{\mathbf{C}}_{k;n}})-{\lambda _{l}}({\mathbf{C}_{k}})\\ {} \displaystyle ={\sum \limits_{i,j=1}^{d}}\left(\frac{\partial }{\partial c(i,j;k)}{\lambda _{l}}({\mathbf{C}_{k}})\right)({\hat{c}_{;n}}(i,j;k)-c(i,j;k))+o(|{\hat{\mathbf{C}}_{k;n}}-{\mathbf{C}_{k}}|).\end{array}\]

Applying (16) with $t=c(i,j;k)$ we obtain

\[ \sqrt{n}({\hat{\lambda }_{;n}}(l;k))-\lambda (l;k))={(\mathbf{v}(l;k))^{T}}\sqrt{n}({\hat{\mathbf{C}}_{k;n}}-{\mathbf{C}_{k}})\mathbf{v}(l;k)+o(|{\hat{\mathbf{C}}_{k;n}}-{\mathbf{C}_{k}}|).\]

By Theorem 2 this implies (14).

Proceeding by the same way with (17) in view we obtain

\[ \sqrt{n}({\mathbf{v}_{l}}({\hat{\mathbf{C}}_{k;n}})-{\mathbf{v}_{l}}({\mathbf{C}_{k}}))\stackrel{\text{W}}{\longrightarrow }{({\mathbf{C}_{k}}-\lambda (l;k)\mathbb{E})^{+}}{\mathbf{Z}_{k}}\mathbf{v}(l;k).\]

It is rather similar to (15), but recall that ${\hat{\mathbf{v}}_{;n}}(l;k))=\pm {\mathbf{v}_{l}}({\hat{\mathbf{C}}_{k;n}})$. Observe that ${\mathbf{v}_{l}}({\hat{\mathbf{C}}_{k;n}})-{\mathbf{v}_{l}}({\mathbf{C}_{k}})={O_{P}}(1/\sqrt{n})$, while $1/\sqrt{n}=o({\varepsilon _{n}})$. So our sign choosing rule for ${\hat{\mathbf{v}}_{;n}}(l;k)$ will choose the right sign with probability tending to 1 as $n\to \infty $:

\[ \operatorname{\mathsf{P}}\{{\hat{\mathbf{v}}_{;n}}(l;k)={\mathbf{v}_{l}}({\hat{\mathbf{C}}_{k;n}})\}\to 1\hspace{2.5pt}\text{as}\hspace{2.5pt}n\to \infty .\]

Theorem is proved. □

6 Confidence intervals for eigenvalues

Theorem 3 can be applied to testing hypotheses on PC directions of different mixture components. It also allows one to construct confidence sets for PC directions and eigenvalues. As an example we consider construction of a confidence interval for one eigenvalue $\lambda (l;k)$ of the k-th mixture component covariance matrix ${\mathbf{C}_{k}}$.

By Theorem 3

\[ \sqrt{n}({\hat{\lambda }_{;n}}(l;k)-\lambda (l;k))\stackrel{\text{W}}{\longrightarrow }N(0,{S^{2}}(l,k)),\]

where

(18)

\[\begin{array}{l}\displaystyle {S^{2}}(l,k)=\operatorname{Var}{\mathbf{v}^{T}}(l;k){\mathbf{Z}_{k}}\mathbf{v}(l;k)\\ {} \displaystyle ={\sum \limits_{{i_{1}},{i_{2}},{i_{3}},{i_{4}}=1}^{d}}{v^{{i_{1}}}}(l;k){v^{{i_{2}}}}(l;k){v^{{i_{3}}}}(l;k){v^{{i_{4}}}}(l;k)V({i_{1}},{i_{2}},{i_{3}},{i_{4}};k,k).\end{array}\]

An estimator ${\hat{S}_{;n}^{2}}(l;k)$ for ${S^{2}}(l;k)$ can be obtained by replacing ${v^{i}}(l;k)$ and $V({i_{1}},{i_{2}},{i_{3}},{i_{4}};k,k)$ by their consistent estimators ${\hat{v}_{;n}^{i}}(l;k)$ and ${\hat{V}_{;n}}({i_{1}},{i_{2}},{i_{3}},{i_{4}};k,k)$. It is obvious that ${\hat{S}_{;n}^{2}}(l;k)$ is consistent under the assumptions of Theorem 3.

So, if ${S^{2}}(l,k)>0$, then

\[ \frac{\sqrt{n}}{{\hat{S}_{;n}}(l;k)}({\hat{\lambda }_{;n}}(l;k)-\lambda (l;k))\stackrel{\text{W}}{\longrightarrow }N(0,1).\]

Let ${x_{\alpha /2}}$ be the standard normal quantile of level $1-\alpha /2$,

\[ {\lambda _{;n}^{\pm }}(l;k)={\hat{\lambda }_{;n}}(l;k)\pm {x_{\alpha /2}}\frac{{\hat{S}_{;n}}(l;k)}{\sqrt{n}}.\]

Then

\[ \underset{n\to \infty }{\lim }\operatorname{\mathsf{P}}\{\lambda (l;k)\in [{\lambda _{;n}^{-}}(l;k),{\lambda _{;n}^{+}}(l;k)]\}=1-\alpha .\]

I.e. $[{\lambda _{;n}^{-}}(l;k),{\lambda _{;n}^{+}}(l;k)]$ is an asymptotic confidence interval for $\lambda (l;k)$ with the significance level α if the assumptions of Theorem 3 hold and ${S^{2}}>0$. The last assumption is not too restrictive. In the Appendix we present a simple condition under which it holds.

7 Results of simulations

To evaluate the finite-sample behavior of the proposed technique, we performed a small simulation study. Confidence intervals for the largest eigenvalue $\lambda (1;k)$ with the nominal significance level $\alpha =0.05$ were calculated on data simulated from the tree-component MVC model. In each experiment there were $B=1000$ simulations for each sample size $n=250,500,\dots ,10000$.

For each mixture component we present the coverage frequency of the intervals, i.e. the number of confidence intervals which covered the true $\lambda (1;k)$ divided by B.

In all the experiments the concentrations ${\mathbf{p}_{j;n}}=({p_{j;n}^{1}},{p_{j;n}^{2}},{p_{j;n}^{3}})$ were generated as independent vectors uniformly distributed on the simplex $\{\mathbf{p}:{p^{m}}\ge 0,m=1,\dots ,3,{p^{1}}+{p^{2}}+{p^{3}}=1\}$.

The observations ${\mathbf{X}_{j}}=({X_{j}^{1}},{X_{j}^{2}},{X_{j}^{3}})$ were three-dimensional.

In the first experiment the distribution of ${\mathbf{X}_{j}}$ was Gaussian in each mixture component ${F_{m}}\sim N({\boldsymbol{\mu }_{m}},{\mathbf{C}_{m}})$, where

(19)

\[ {\boldsymbol{\mu }_{1}}=\left(\begin{array}{c}1\\ {} 0\\ {} 2\end{array}\right),\hspace{2.5pt}{\boldsymbol{\mu }_{2}}=\left(\begin{array}{c}0\\ {} 0\\ {} 0\end{array}\right),\hspace{2.5pt}{\boldsymbol{\mu }_{3}}=\left(\begin{array}{c}1\\ {} 2\\ {} 3\end{array}\right)\]

and

(20)

\[ {\mathbf{C}_{1}}=\left(\begin{array}{c@{\hskip10.0pt}c@{\hskip10.0pt}c}1& -0.5& 0.1\\ {} -0.5& 2& 0.4\\ {} 0.1& 0.4& 3\end{array}\right),\hspace{2.5pt}{\mathbf{C}_{2}}=\left(\begin{array}{c@{\hskip10.0pt}c@{\hskip10.0pt}c}2& 0& 0\\ {} 0& 1& 0\\ {} 0& 0& 0.5\end{array}\right),\hspace{2.5pt}{\mathbf{C}_{3}}=\left(\begin{array}{c@{\hskip10.0pt}c@{\hskip10.0pt}c}5& 1& 1\\ {} 1& 2& 1\\ {} 1& 1& 0.5\end{array}\right).\]

The simulation results are presented on Table 1. It seems that the coverage probabilities of obtained confidence intervals are satisfactory for practical purposes if the sample size n is greater then 1000.

Table 1.

Coverage probabilities of the first experiment

	Components
n	first	second	third
250	0.973	0.872	0.956
500	0.969	0.925	0.952
1000	0.962	0.968	0.953
2500	0.952	0.966	0.951
5000	0.948	0.941	0.956
10000	0.955	0.960	0.953

In the second experiment we used the same parameters for the model as in the first one, except the covariance of the second component. Here we put ${\mathbf{C}_{2}}=\mathbb{E}$ (unit matrix). Surely, this model does not satisfy the assumptions of Theorem 3, since the first eigenvalue of ${\mathbf{C}_{2}}$ is not simple. The results are presented on Table 2. It seems that the confidence interval for the second component’s largest eigenvalue is unsatisfactory even for $n=10000$. The intervals for the first and third components preform satisfactory for n larger then 1000.

Table 2.

Coverage probabilities of the second experiment

	Components
n	first	second	third
250	0.971	0.839	0.959
500	0.961	0.879	0.964
1000	0.964	0.919	0.955
2500	0.954	0.925	0.945
5000	0.943	0.932	0.945
10000	0.950	0.921	0.947

8 Discussion

We proposed a technique for estimation of PC directions and eigenvalues by observations from MVC. Asymptotic normality of the estimators is proved. This opens possibilities for constructing confidence sets and testing hypotheses on PC structure of different mixture components. Results of simulations confirm applicability of the asymptotic results for samples of moderate size.

Now let us discuss some challenges which were not answered in this study.

1. To apply Theorem 3 for statistical analysis of real life data, one needs to be sure that the assumption (iv) of this theorem holds. Is it possible to verify the hypotheses A: all the eigenvalues of ${\mathbf{C}_{k}}$ are simple by some statistical test? Note, that in this test A must be an alternative to the null ${H_{0}}$: there exists an eigenvalue of ${\mathbf{C}_{k}}$ of degree higher then 1. Of course, such test cannot be based on the confidence sets derived by Theorem 3, since it does not hold under ${H_{0}}$. A possible alternative is to use a bootstrap technique for such testing.

2. It is sometimes useful in cluster analysis applications to consider FMMs with growing number of components (clusters) as the sample size n tends to infinity [5]. Similar approach is also beneficial in signal processing [3]. In this case one expects nonparametric convergence rates in theorems similar to Theorem 3. Another generalizations of Theorem 3 are possible if the dimension of the observations space $d\to \infty $ as $n\to \infty $.

3. There are many alternatives to PCA as a dimension reduction technique, e.g., Projection Pursuit (PP) or Independent Components Analysis [6]. Some of them, such as the PP based on the maximization of kurtosis can be modified for application to MVC data similarly to the PCA modification considered in this study. It would be interesting to analyze efficiency of these modifications both theoretically and in real life data analysis.

We hope that further study will clarify answers on these questions.

Authors

Abstract

1 Introduction

2 Classical principal component analysis

3 Mixtures with varying concentrations

(1)

(2)

(3)

4 Estimation of covariance matrices

(4)

(5)

(6)

Theorem 1.

(7)

Proof.

(8)

(9)

(10)

Theorem 2.

Proof.

(11)

(12)

5 Principal components for mixtures

(13)

Theorem 3.

(14)

(15)

Remark.

Proof.

(16)

(17)

6 Confidence intervals for eigenvalues

(18)

7 Results of simulations

(19)

(20)

Table 1.

Table 2.

8 Discussion

A Appendix

Theorem 4.

Proof.

Acknowledgement

References

Export citation

Copy and paste formatted citation

Download citation in file