1 Introduction
In medicine and other branches of science one frequently faces binarization problems, i.e. one wants to predict to which of two classes an item belongs based on characteristics or data of an individual item. To solve this kind of problems, a lot of competing algorithms—like neural nets, random forests or estimators for logistic regression models—are available. Despite all differences in details, the general approach of these algorithms is similar. First, a classification model is fit using some training data for which the true classification is known. Then this model can be used to classify new data for which the true classification is unknown. In order to evaluate the quality of a classification model, it is subsequently often applied to some test dataset for which the true classification is known and the true classification is compared with the prediction of the model. A frequently used method to quantify the result of this comparison is the area under the curve (AUC) that we are going to introduce in detail in Section 2.
There are several competing confidence intervals for the AUC. Most of them, like the popular DeLong algorithm [4], also those by Kottas, Kuss & Zapf [6], LeDell, Petersen & v. d. Laan [8], and all intervals compared in Qin & Hotilovac [12], assume that the number of members of the case and control groups are deterministic. However, in practice often only the total number of members of the test group can be controlled, while the assignment to the case and control groups is random. Using the theory of U-statistics we propose new confidence intervals that take this into account.
Moreover, the confidence intervals cited above are designed for the evaluation of the (fitted) classification model. It is assumed that this model is perfectly true and the randomness just comes from the fact that the test set is a random sample. This is alright, if one wants to assess the quality of the fitted model. However, often the real question is whether an algorithm is suitable for a certain kind of data. Then the uncertainty that arises due to the fitting of the model parameters based on a finite training set has to be considered as well. We will examine this for the logistic regression model. We will see in a simulation study that this uncertainty is of practical relevance and ignoring it can lead to a seriously too low coverage probability of the confidence intervals. On the theoretical side, however, we will see that this uncertainty is asymptotically neglectable.
A related question is, what happens, when all observations are used both for training and for testing. This situation was considered in [10].
We will take this opportunity to show that two well-known confidence intervals for the AUC, namely DeLong’s intervals [4] and the Mann–Whitney intervals due to Sen [13], coincide. In the literature they are usually quoted under either name without noticing that they are the same, and in Qin & Hotilovac [12] they are even compared against each other.
This paper is organized as follows. In Section 2 we introduce the AUC and the logistic regression model in full detail. In Section 3 we derive the form of the confidence intervals from central limit theorems for the AUC. Section 4 is devoted to a simulation study and in Section 5 we apply the new confidence intervals to electrocardiogram (ECG) data. Finally, in Section 6 we discuss our results and point out directions for future research. The proofs of the theoretical results will be postponed to the Appendix.
2 Preliminaries
In this section we introduce the preliminaries we need on logistic regression models and the AUC.
A logistic regression model is a family of probability distributions on ${\mathbb{R}^{p}}\times \{0,1\}$ indexed by a parameter in ${\mathbb{R}^{p}}$. The random vectors $(X,I)$ of a logistic regression model fulfill
We assume that the observations $({X_{1,1}},{I_{1,1}}),\dots ,({X_{1,m}},{I_{1,m}})$ form an independent, identically distributed (i.i.d.) sample in which each member follows the logistic regression model. So we do not assume that the design points ${X_{1,1}},\dots ,{X_{1,m}}$ are aligned on a grid, but we suppose that they are irregularly scattered. These data points form the training set. Logistic regression models are well studied in the literature; in particular the maximum-likelihood estimator $\hat{\beta }$ for β is known (see, e.g., [5]). It holds that $\hat{\beta }-\beta \to \mathcal{N}(0,{F^{-1}}(\beta ))$ in distribution as $m\to \infty $, where $F(\beta )$ is the Fisher information matrix. A consistent estimator for $F(\beta )$ is given by
\[ H(\beta ):={\sum \limits_{i=1}^{m}}{X_{1,i}}{X_{1,i}^{T}}\pi \big({\beta ^{T}}{X_{1,i}}\big)\cdot \big(1-\pi \big({\beta ^{T}}{X_{1,i}}\big)\big)\]
with $\pi (t)=1/(\exp \{t\}+1)$ (see [5, pp. 200–203]).Once the model is fit, i.e. the parameter β is estimated, the probability ${Y_{2,i}}:=\mathbb{P}({I_{2,i}}=1\mid {X_{2,i}})$, $i=1,\dots ,n$, can be estimated for new data points ${X_{2,1}},\dots ,{X_{2,n}}$ that form the test set. In order to get a prediction for the class ${I_{2,i}}$ one can choose a threshold $c\in (0,1)$ and put ${\hat{I}_{2,i}}:={\mathbf{1}_{[c,1]}}({\hat{Y}_{2,i}})$.
Notice that the behavior we saw above for the logistic regression model is typical for classification algorithms. At first, some $[0,1]$-valued score function ${Y_{2,i}}$ is derived—for logistic regression models this is the probability that $\{{I_{2,i}}=1\}$, for neural nets it is the value of the nodes in the output layer and for random forests it is the ratio of all trees predicting $\{{I_{2,i}}=1\}$. Then the predicted classification is obtained by thresholding ${\hat{Y}_{2,i}}$.
Now the prediction quality of the logistic regression model—or any other classification algorithm that works as indicated in the last paragraph—can be assessed using the area under the receiver operating curve (AUC). The empirical AUC is
\[ \hat{A}=\frac{{\textstyle\textstyle\sum _{i=1}^{n}}{\textstyle\textstyle\sum _{j=1}^{n}}{\mathbf{1}_{\{{\hat{Y}_{2,i}}\lt {\hat{Y}_{2,j}}\}}}{\mathbf{1}_{\{{I_{2,i}}=0\}}}{\mathbf{1}_{\{{I_{2,j}}=1\}}}}{({\textstyle\textstyle\sum _{i=1}^{n}}{\mathbf{1}_{\{{I_{2,i}}=0\}}})\cdot ({\textstyle\textstyle\sum _{j=1}^{n}}{\mathbf{1}_{\{{I_{2,j}}=1\}}})}+\frac{1}{2}\cdot \frac{{\textstyle\textstyle\sum _{i=1}^{n}}{\textstyle\textstyle\sum _{j=1}^{n}}{\mathbf{1}_{\{{\hat{Y}_{2,i}}={\hat{Y}_{2,j}}\}}}{\mathbf{1}_{\{{I_{2,i}}=0\}}}{\mathbf{1}_{\{{I_{2,j}}=1\}}}}{({\textstyle\textstyle\sum _{i=1}^{n}}{\mathbf{1}_{\{{I_{2,i}}=0\}}})\cdot ({\textstyle\textstyle\sum _{j=1}^{n}}{\mathbf{1}_{\{{I_{2,j}}=1\}}})}.\]
Its theoretical counterpart is
\[ A=\mathbb{P}({Y_{1}}\lt {Y_{2}}\mid {I_{1}}=0,{I_{2}}=1)+\frac{1}{2}\cdot \mathbb{P}({Y_{1}}={Y_{2}}\mid {I_{1}}=0,{I_{2}}=1).\]
The name of the AUC comes from the fact that it equals a certain area; see Figure 1. For further information on the AUC, see [11].Fig. 1.
Interpretation of the AUC as an area. If the ROC curve (see [11]) is the thick black line, then the AUC is the area of the gray polygon
3 Theoretical results
In this section, we present the mathematical theorems that are needed in order to establish the new confidence intervals.
We consider two models.
Model 1.
Let $({\hat{Y}_{2,i}},{I_{2,i}})=({Y_{2,i}},{I_{2,i}})$, $i=1,\dots ,n$, be an i.i.d. sample of a real-valued variable Y and a binary variable I.
Model 2.
Let ${\hat{Y}_{2,i}}$, $i=1,\dots ,n$, be the fitted values of a logistic regression model (see Section 2) with true parameter ${\beta _{0}}\in {\mathbb{R}^{p}}$, applied to i.i.d. data points ${X_{2,i}}$, $i=1,\dots ,n$, and let ${I_{2,i}}$, $1,\dots ,n$, be the known true classification. We shall assume that the distribution of the points ${X_{2,i}}$, $i=1,\dots ,n$, is absolutely continuous with the Lebesgue density q. For two points ${X_{2,i}}$ and ${X_{2,j}}$, $j\ne i$, the distribution of $({X_{2,i}}-{X_{2,j}})/\| {X_{2,i}}-{X_{2,j}}\| $ has a bounded density with respect to the ($p-1$)-dimensional Hausdorff measure on the unit sphere. Moreover, the cardinality m of the training set and the cardinality n of the test set should fulfill
Model 1 is the classical model used most frequently in the literature so far (see DeLong et al. [4], Kottas et al. [6], Qin & Hotilovac [12] and Sen [13]). The idea is that ${\hat{Y}_{2,i}}$, $i=1,\dots ,n$, are fitted values obtained by applying a completely known model to an i.i.d. sample of data points. As a consequence of this simplification ${\hat{Y}_{2,i}}$ and ${Y_{2,i}}$ always coincide under Model 1 and the “real” value of ${Y_{2,i}}$ is ignored. Notice that the first index 2 of the observations is not necessary if one only considers Model 1, since then there are no training observations $({X_{1,i}},{I_{1,i}})$. We just add this index in order to be able to treat Model 1 and Model 2 jointly.
Model 2 takes the more realistic point of view that the classification model is disturbed by random effects that arose in the model fitting procedure. However, under Model 2 we require that the used classification model is the logistic regression model, while under Model 1 we make no assumptions on the classification model.
We put
where Σ is the asymptotic covariance matrix of
\[ \left(\begin{array}{c}{\textstyle\textstyle\sum _{i=1}^{n}}{\textstyle\textstyle\sum _{j=1}^{n}}{\mathbf{1}_{\{{Y_{2,i}}\lt {Y_{2,j}}\}}}{\mathbf{1}_{\{{I_{2,i}}=0\}}}{\mathbf{1}_{\{{I_{2,j}}=1\}}}+\frac{1}{2}\cdot {\textstyle\textstyle\sum _{i=1}^{n}}{\textstyle\textstyle\sum _{j=1}^{n}}{\mathbf{1}_{\{{Y_{2,i}}={Y_{2,j}}\}}}{\mathbf{1}_{\{{I_{2,i}}=0\}}}{\mathbf{1}_{\{{I_{2,j}}=1\}}}\\ {} {\textstyle\textstyle\sum _{i=1}^{n}}{\mathbf{1}_{\{{I_{2,i}}=0\}}}\\ {} {\textstyle\textstyle\sum _{j=1}^{n}}{\mathbf{1}_{\{{I_{2,j}}=1\}}}\end{array}\right)\]
and where
\[ v=\left(\begin{array}{c}\frac{1}{\mathbb{P}({I_{1}}=0)\cdot \mathbb{P}({I_{1}}=1)}\\ {} -\frac{\mathbb{P}({Y_{1}}\lt {Y_{2}},{I_{1}}=0,{I_{2}}=1)}{(\mathbb{P}{({I_{1}}=0)^{2}}\cdot \mathbb{P}({I_{2}}=1)}-\frac{1}{2}\cdot \frac{\mathbb{P}({Y_{1}}={Y_{2}},{I_{1}}=0,{I_{2}}=1)}{\mathbb{P}{({I_{1}}=0)^{2}}\cdot \mathbb{P}({I_{2}}=1)}\\ {} -\frac{\mathbb{P}({Y_{1}}\lt {Y_{2}},{I_{1}}=0,{I_{2}}=1)}{\mathbb{P}({I_{1}}=0)\cdot \mathbb{P}{({I_{2}}=1)^{2}}}-\frac{1}{2}\cdot \frac{\mathbb{P}({Y_{1}}={Y_{2}},{I_{1}}=0,{I_{2}}=1)}{\mathbb{P}({I_{1}}=0)\cdot \mathbb{P}{({I_{2}}=1)^{2}}}\end{array}\right),\]
with $({Y_{1}},{I_{1}},{Y_{2}},{I_{2}})$ having the same distribution as $({Y_{2,i}},{I_{2,i}},{Y_{2,j}},{I_{2,j}})$, $j\ne i$. In the Appendix, we will show that ${\sigma _{A}^{2}}$ is the asymptotic variance of $\hat{A}$. Let ${S_{A}^{2}}$ be the plug-in estimator for ${\sigma _{A}^{2}}$, where all probabilities involved in the definition of v are estimated by their corresponding relative frequencies and where
\[\begin{aligned}{}\hat{\Sigma }& =\frac{1}{n\cdot (n-1)\cdot (n-2)}{\sum \limits_{i,j,k=1}^{n}}\left(\begin{array}{c}{a_{ij}}\\ {} {\mathbf{1}_{\{{I_{2,i}}=0\}}}+{\mathbf{1}_{\{{I_{2,j}}=0\}}}\\ {} {\mathbf{1}_{\{{I_{2,i}}=1\}}}+{\mathbf{1}_{\{{I_{2,j}}=1\}}}\end{array}\right)\\ {} & \hspace{1em}\times \left(\begin{array}{c@{\hskip10.0pt}c@{\hskip10.0pt}c}{a_{ik}}& {\mathbf{1}_{\{{I_{2,i}}=0\}}}+{\mathbf{1}_{\{{I_{2,k}}=0\}}}& {\mathbf{1}_{\{{I_{2,i}}=1\}}}+{\mathbf{1}_{\{{I_{2,k}}=1\}}}\end{array}\right)\\ {} & \hspace{1em}-\left(\frac{1}{n\cdot (n-1)}{\sum \limits_{i,j=1}^{n}}\left(\begin{array}{c}{a_{ij}}\\ {} {\mathbf{1}_{\{{I_{2,i}}=0\}}}+{\mathbf{1}_{\{{I_{2,j}}=0\}}}\\ {} {\mathbf{1}_{\{{I_{2,i}}=1\}}}+{\mathbf{1}_{\{{I_{2,j}}=1\}}}\end{array}\right)\right)\\ {} & \hspace{1em}\times \left(\frac{1}{n\cdot (n-1)}{\sum \limits_{i,j=1}^{n}}\left(\begin{array}{c@{\hskip10.0pt}c@{\hskip10.0pt}c}{a_{ij}}& {\mathbf{1}_{\{{I_{2,i}}=0\}}}+{\mathbf{1}_{\{{I_{2,j}}=0\}}}& {\mathbf{1}_{\{{I_{2,i}}=1\}}}+{\mathbf{1}_{\{{I_{2,j}}=1\}}}\end{array}\right)\right)\end{aligned}\]
with
\[\begin{aligned}{}{a_{ij}}& ={\mathbf{1}_{\{{\hat{Y}_{2,i}}\lt {\hat{Y}_{2,j}}\}}}\cdot {\mathbf{1}_{\{{I_{2,i}}=0\}}}\cdot {\mathbf{1}_{\{{I_{2,j}}=1\}}}+\frac{1}{2}\cdot {\mathbf{1}_{\{{\hat{Y}_{2,i}}={\hat{Y}_{2,j}}\}}}\cdot {\mathbf{1}_{\{{I_{2,i}}=0\}}}\cdot {\mathbf{1}_{\{{I_{2,j}}=1\}}}\\ {} & \hspace{1em}+{\mathbf{1}_{\{{\hat{Y}_{2,j}}\lt {\hat{Y}_{2,i}}\}}}\cdot {\mathbf{1}_{\{{I_{2,j}}=0\}}}\cdot {\mathbf{1}_{\{{I_{2,i}}=1\}}}+\frac{1}{2}\cdot {\mathbf{1}_{\{{\hat{Y}_{2,j}}={\hat{Y}_{2,i}}\}}}\cdot {\mathbf{1}_{\{{I_{2,j}}=0\}}}\cdot {\mathbf{1}_{\{{I_{2,i}}=1\}}}\end{aligned}\]
is the estimator for Σ. The consistency of ${S_{A}^{2}}$ will be established in the course of the proof of Theorem 1. We remark that the calculation of $\hat{\Sigma }$ is indeed not $O({n^{3}})$, but $O({n^{2}})$, because
\[\begin{aligned}{}& {\sum \limits_{i,j,k=1}^{n}}\left(\begin{array}{c}{a_{ij}}\\ {} {\mathbf{1}_{\{{I_{2,i}}=0\}}}+{\mathbf{1}_{\{{I_{2,j}}=0\}}}\\ {} {\mathbf{1}_{\{{I_{2,i}}=1\}}}+{\mathbf{1}_{\{{I_{2,j}}=1\}}}\end{array}\right)\left(\begin{array}{c@{\hskip10.0pt}c@{\hskip10.0pt}c}{a_{ik}}& {\mathbf{1}_{\{{I_{2,i}}=0\}}}+{\mathbf{1}_{\{{I_{2,k}}=0\}}}& {\mathbf{1}_{\{{I_{2,i}}=1\}}}+{\mathbf{1}_{\{{I_{2,k}}=1\}}}\end{array}\right)\\ {} & \hspace{1em}={\sum \limits_{i=1}^{n}}{w_{i}}{w_{i}^{T}},\end{aligned}\]
where
The proof of this theorem will be given in Appendix A.
Corollary 1.
Assume Model 1 or Model 2 with the notation introduced above. Let $g:(0,1)\to \mathbb{R}$ be a ${C^{1}}$-function with ${g^{\prime }}(x)\ne 0$ for all $x\in (0,1)$. Then
\[ \sqrt{n}\cdot \frac{g(\hat{A})-g(A)}{{g^{\prime }}(\hat{A})\cdot \sqrt{{S_{A}^{2}}}}\to \mathcal{N}(0,1)\]
in distribution as $n\to \infty $.
The proof of this corollary will be given in Appendix A.
Let ${z_{\alpha }}$ be the α-quantile of the $\mathcal{N}(0,1)$-distribution.
This corollary is immediate from the theorem.
Corollary 3.
Assume Model 1 or Model 2 with the notation introduced above. Let $g:(0,1)\to \mathbb{R}$ be a bijective ${C^{1}}$-function with ${g^{\prime }}(x)\gt 0$ for all $x\in (0,1)$. Then the interval
\[ \bigg({g^{-1}}\bigg(g(\hat{A})+{z_{\alpha /2}}\cdot {g^{\prime }}(\hat{A})\cdot \sqrt{\frac{{S_{A}^{2}}}{n}}\bigg),{g^{-1}}\bigg(g(\hat{A})+{z_{1-\alpha /2}}\cdot {g^{\prime }}(\hat{A})\cdot \sqrt{\frac{{S_{A}^{2}}}{n}}\bigg)\bigg)\]
has asymptotically the coverage probability $1-\alpha $ for A.
This corollary is immediate from Corollary 1.
4 Simulations
In this section, we compare the performance of the new proposed confidence intervals from Corollary 2 and Corollary 3 with the logit-function as g to DeLong’s interval [4, 13] and the Modified Wald interval [6] based on simulations. We consider two different scenarios, namely the binormal model which is classical for the investigation of the AUC and the fitting of a logistic regression model.
In the binormal model the procedure of fitting the model is ignored and it is assumed that the fitted values of the control observations and the fitted values of the case observations are normally distributed with different means. We assume that the fitted values of the case observations follow the $\mathcal{N}(0,1)$-distribution and the fitted values of the control observations follow the $\mathcal{N}({\mu _{1}},1)$-distribution for ${\mu _{1}}=1$ or ${\mu _{1}}=2$. We use $n=20,200,2000$ observations in the test set of which one half belongs to the case group and the other half belongs to the control group. For each parameter combination we determined the coverage probability and the mean length of the confidence intervals based on 10,000 simulation runs. The true AUC needed to calculate the coverage probability was determined analytically. The results are reported in Table 1 and Table 2.
Table 1.
Estimated coverage probabilities in the binormal model. In the first row we give the number n of observations and in the second row we give the expected value ${\mu _{1}}$ of the fitted values of the case observations
20 | 200 | 2000 | 20 | 200 | 2000 | |
1 | 1 | 1 | 2 | 2 | 2 | |
Corollary 2 | 0.6154 | 0.9359 | 0.9494 | 0.0038 | 0.8772 | 0.9462 |
Corollary 3 | 0.5999 | 0.9389 | 0.9494 | 0.0000 | 0.8864 | 0.9463 |
DeLong | 0.9026 | 0.9446 | 0.9505 | 0.7910 | 0.9369 | 0.9499 |
Modified Wald | 0.9225 | 0.9543 | 0.9590 | 0.8577 | 0.9709 | 0.9797 |
Table 2.
Mean value of the interval lengths in the binormal model. The further details are the same as for Table 1
20 | 200 | 2000 | 20 | 200 | 2000 | |
1 | 1 | 1 | 2 | 2 | 2 | |
Corollary 2 | 0.1911 | 0.1261 | 0.0412 | 0.0126 | 0.0602 | 0.0225 |
Corollary 3 | 0.1859 | 0.1258 | 0.0412 | 0.0125 | 0.0612 | 0.0225 |
DeLong | 0.4280 | 0.1315 | 0.0414 | 0.2208 | 0.0721 | 0.0228 |
Modified Wald | 0.4251 | 0.1365 | 0.0432 | 0.2496 | 0.0858 | 0.0272 |
All confidence intervals have a too low coverage probability at small sample size. Not surprisingly, this gets better as the number of observation grows. We see that for a small sample size, the new confidence intervals are shorter than the ones reported in the literature at the price of having a lower coverage probability. For a large sample size there is hardly a difference between the new confidence intervals and DeLong’s confidence intervals.
Now we consider the AUC from fitting a logistic regression model. For these simulations we assumed that $m+n=100,1000,10000$ independent design points are drawn from a multivariate standard normal distribution in ${\mathbb{R}^{p}}$ for $p=10,100$. However, we dropped the combination $m+n=100$ and $p=100$, since then we have more parameters than observations. We let 80% of the observations be training data and 20% be test data; so $n=20,200,2000$ observations are used for testing and the results are comparable to the results for the binormal model. We considered two models for the true class: a logistic regression model with the first unit vector as true parameter and a logistic regression model whose true parameter satisfies
\[ \langle {\beta _{0}},{e_{j}}\rangle =\frac{j-p/2}{\sqrt{{\textstyle\textstyle\sum _{i=1}^{p}}{(i-p/2)^{2}}}},\hspace{1em}j=1,\dots ,p.\]
It is easily seen that the (absolute) probability that an observation is assigned to the case class is one half—however, unlike for the binormal model, now the choice is made independently for each observation. For determining the coverage probability one has to decide what should be the target parameter. The approach considered in the literature so far takes
\[ {A_{1}}=\mathbb{P}{\big({\beta ^{T}}{X_{1}}\lt {\beta ^{T}}{X_{2}}\mid {I_{1}}=0,{I_{2}}=1\big)_{|\beta =\hat{\beta }}}+\frac{1}{2}\cdot \mathbb{P}{\big({\beta ^{T}}{X_{1}}={\beta ^{T}}{X_{2}}\mid {I_{1}}=0,{I_{2}}=1\big)_{|\beta =\hat{\beta }}},\]
which is alright if you are interested in the quality of the fitted model. If you are interested in the quality of the classification algorithm, it makes more sense to consider
\[ {A_{2}}=\mathbb{P}\big({\beta _{0}^{T}}{X_{1}}\lt {\beta _{0}^{T}}{X_{2}}\mid {I_{1}}=0,{I_{2}}=1\big)+\frac{1}{2}\cdot \mathbb{P}\big({\beta _{0}^{T}}{X_{1}}={\beta _{0}^{T}}{X_{2}}\mid {I_{1}}=0,{I_{2}}=1\big),\]
where ${\beta _{0}}$ is the true parameter. For each parameter combination we determined the coverage probability for ${A_{1}}$, the coverage probability for ${A_{2}}$ and the mean length of the confidence intervals based on 10,000 simulation runs. For simulating the true value of ${A_{2}}$ we generated a sample of ${10^{8}}$ observations and for simulating the true value of ${A_{1}}$ we generated in each simulation run a sample of ${10^{6}}$ observations. The results are reported in Tables 3–8.Table 3.
Estimated coverage probabilities for ${A_{1}}$ in the logistic regression model with the first unit vector as true parameter. In the first row we give the dimension p and in the second row we present the number $m+n$ of observations
10 | 10 | 10 | 100 | 100 | |
100 | 1000 | 10000 | 1000 | 10000 | |
Corollary 2 | 0.727 | 0.937 | 0.948 | 0.939 | 0.951 |
Corollary 3 | 0.737 | 0.943 | 0.948 | 0.946 | 0.951 |
DeLong | 0.915 | 0.946 | 0.949 | 0.946 | 0.952 |
Modified Wald | 0.912 | 0.953 | 0.955 | 0.951 | 0.960 |
Table 4.
Estimated coverage probabilities for ${A_{2}}$ in the logistic regression model with the first unit vector as true parameter. The further details are the same as in Table 3
10 | 10 | 10 | 100 | 100 | |
100 | 1000 | 10000 | 1000 | 10000 | |
Corollary 2 | 0.718 | 0.936 | 0.949 | 0.665 | 0.903 |
Corollary 3 | 0.691 | 0.935 | 0.950 | 0.616 | 0.894 |
DeLong | 0.919 | 0.946 | 0.950 | 0.684 | 0.904 |
Modified Wald | 0.909 | 0.950 | 0.957 | 0.689 | 0.915 |
Table 5.
Mean value of the interval length for the logistic regression model with the first unit vector as true parameter. The further details are the same as in Table 3
10 | 10 | 10 | 100 | 100 | |
100 | 1000 | 10000 | 1000 | 10000 | |
Corollary 2 | 0.2926 | 0.1329 | 0.0428 | 0.1428 | 0.0432 |
Corollary 3 | 0.2806 | 0.1325 | 0.0428 | 0.1421 | 0.0432 |
DeLong | 0.4921 | 0.1379 | 0.0429 | 0.1473 | 0.0434 |
Modified Wald | 0.4646 | 0.1415 | 0.0445 | 0.1490 | 0.0448 |
Table 6.
Coverage probability of ${A_{1}}$ for the logistic regression model with the “skew” true parameter. The further details are the same as in Table 3
10 | 10 | 10 | 100 | 100 | |
100 | 1000 | 10000 | 1000 | 10000 | |
Corollary 2 | 0.724 | 0.937 | 0.948 | 0.939 | 0.948 |
Corollary 3 | 0.734 | 0.944 | 0.949 | 0.945 | 0.950 |
DeLong | 0.907 | 0.947 | 0.949 | 0.947 | 0.950 |
Modified Wald | 0.905 | 0.955 | 0.957 | 0.951 | 0.957 |
Table 7.
Coverage probability of ${A_{2}}$ for the logistic regression model with the “skew” true parameter. The further details are the same as in Table 3
10 | 10 | 10 | 100 | 100 | |
100 | 1000 | 10000 | 1000 | 10000 | |
Corollary 2 | 0.716 | 0.940 | 0.949 | 0.661 | 0.897 |
Corollary 3 | 0.686 | 0.938 | 0.949 | 0.616 | 0.890 |
DeLong | 0.912 | 0.948 | 0.950 | 0.681 | 0.898 |
Modified Wald | 0.903 | 0.953 | 0.958 | 0.687 | 0.910 |
Table 8.
Mean length for the logistic regression model with the “skew” true parameter. The further details are the same as in Table 3
10 | 10 | 10 | 100 | 100 | |
100 | 1000 | 10000 | 1000 | 10000 | |
Corollary 2 | 0.2918 | 0.1329 | 0.0428 | 0.1428 | 0.0432 |
Corollary 3 | 0.2800 | 0.1324 | 0.0428 | 0.1420 | 0.0432 |
DeLong | 0.4903 | 0.1379 | 0.0429 | 0.1472 | 0.0434 |
Modified Wald | 0.4638 | 0.1415 | 0.0445 | 0.1490 | 0.0448 |
We obtain essentially the same results as for the binormal model. All confidence intervals have a too low coverage probability at small sample size, but this gets better as the number of observation grows. At small sample size the new confidence intervals have a lower coverage probability and a shorter length than DeLong’s intervals or the modified Wald intervals, while at a large sample size there is not much difference between the intervals. Moreover, we see that when ${A_{2}}$ is the target, we have a curse of dimensionality, i.e. the coverage probability drops at high dimensions. In particular, it is seriously too low for $p=100$ and $m+n=1000$ and a bit too low for $p=100$ and $m+n=10,000$. The results for the first unit vector as ${\beta _{0}}$ are quite similar to the results for the “skew” vector ${\beta _{0}}$. This is not surprising, since the first unit vector is mapped by a rotation on the “skew” vector and both the logistic regression model and the distribution of the design points are invariant under rotations.
What can be done against the curse of dimensionality? Dimension reduction techniques like LASSO have the potential to mitigate the problem. In Tables 9–14 we report the results of LASSO logistic regression with $\lambda =0.05$. The logistic regresssion with LASSO is no longer rotation invariant. Indeed, when the true parameter ${\beta _{0}}$ is the first unit vector, LASSO can be expected to work quite well, since we have one quite large entry and many zero entries. Under this easy parameter setting, LASSO provides a satisfactory solution. For the “skew” true parameter, LASSO can be expected to have problems, since there are many entries which are close to zero, but nonzero. Under this difficult parameter setting, the result with LASSO are even worse than the results without LASSO.
Table 9.
Coverage probability for ${A_{1}}$ for LASSO logistic regression with the first unit vector as true parameter
100 | 100 | |
1000 | 10000 | |
Corollary 2 | 0.935 | 0.953 |
Corollary 3 | 0.943 | 0.953 |
DeLong | 0.945 | 0.954 |
Modified Wald | 0.954 | 0.961 |
Table 10.
Coverage probability for ${A_{2}}$ for LASSO logistic regression with the first unit vector as true parameter
100 | 100 | |
1000 | 10000 | |
Corollary 2 | 0.935 | 0.953 |
Corollary 3 | 0.943 | 0.954 |
DeLong | 0.945 | 0.954 |
Modified Wald | 0.953 | 0.961 |
Table 11.
Mean interval length for LASSO logistic regression with the first unit vector as true parameter
100 | 100 | |
1000 | 10000 | |
Corollary 2 | 0.1315 | 0.0427 |
Corollary 3 | 0.1311 | 0.0427 |
DeLong | 0.1366 | 0.0429 |
Modified Wald | 0.1404 | 0.0444 |
Table 12.
Coverage probability for ${A_{1}}$ for LASSO logistic regression with “skew” true parameter
100 | 100 | |
1000 | 10000 | |
Corollary 2 | 0.9379 | 0.0227 |
Corollary 3 | 0.9423 | 0.0228 |
DeLong | 0.9437 | 0.0228 |
Modified Wald | 0.9432 | 0.0228 |
Table 13.
Coverage probability for ${A_{2}}$ for LASSO logistic regression with “skew” true parameter
100 | 100 | |
1000 | 10000 | |
Corollary 2 | 0.0055 | 0.0000 |
Corollary 3 | 0.0032 | 0.0000 |
DeLong | 0.0067 | 0.0000 |
Modified Wald | 0.0068 | 0.0000 |
Table 14.
Mean interval length for LASSO logistic regression with “skew” true parameter
100 | 100 | |
1000 | 10000 | |
Corollary 2 | 0.15580 | 0.00124 |
Corollary 3 | 0.15458 | 0.00124 |
DeLong | 0.15938 | 0.00124 |
Modified Wald | 0.15881 | 0.05062 |
It is a natural question, whether these confidence intervals can be further improved by bias reduction. In order to assess that, we investigated the bias and the standard deviation under the model assumptions explained above. The results are reported in Table 15 and Table 16. We see that, while the bias in the binormal model and the bias to the target ${A_{1}}$ in the logistic regression model are neglectable, there is a considerable bias to ${A_{2}}$ in the logistic regression model.
Table 15.
Bias and standard deviation of the AUC in the binormal model
20 | 200 | 2000 | 20 | 200 | 2000 | |
1 | 1 | 1 | 2 | 2 | 2 | |
bias | 6.11e-04 | 9.10e-05 | 1.27e-04 | 2.09e-04 | 7.36e-05 | 5.69e-05 |
standard deviation | 0.10805 | 0.03341 | 0.01050 | 0.06137 | 0.01853 | 0.00578 |
Table 16.
Bias and standard deviation of the AUC in the logistic regression model
10 | 10 | 10 | 100 | 100 | |
100 | 1000 | 10000 | 1000 | 10000 | |
mean of A | 0.690 | 0.733 | 0.739 | 0.682 | 0.732 |
A1 | 0.684 | 0.733 | 0.739 | 0.682 | 0.732 |
A2 | 0.74 | 0.74 | 0.74 | 0.74 | 0.74 |
bias to target A1 | 6.22e-03 | 5.29e-05 | 1.57e-05 | 1.15e-04 | 7.12e-05 |
bias to target A2 | 0.049483 | 0.006789 | 0.000745 | 0.057935 | 0.007481 |
standard deviation | 0.1171 | 0.0352 | 0.0110 | 0.0385 | 0.0110 |
5 Real data application
In this section we apply the confidence intervals to medical data.
We want to predict the presence of an obstructive coronary artery disease from ECGs and from seven risk factors (age, sex, systolic blood pressure, LDL, diabetes, smoking status, family history). Of the ECGs we extracted 648 features using the MUSE(TM) (General Electrics, Boston, US) algorithm yielding 648 explanatory variables. The seven risk factors lead eight explanatory variables, since we decided to split the family history in two variables (“present vs. absent or unknown” and “unknown vs. present or absent”). Notice that four of these risk factors are binary and thus, strictly speaking, the assumptions of Model 2 are not fulfilled.
We had data from 283,897 ECGs conducted at the University Hospital of Essen. Since we need to know the true classification, we combined this data with the ECAD registry containing the results of 33,865 coronary angiographies. We found a matching coronary angiography for 13,538 ECGs. The patients, to which these ECGs belong, were assigned to the training group with probability 0.6 and to the test group with probability 0.4 independently of each other. This resulted in 8136 coronary angiographies being assigned to the training group and 5402 coronary angiographies being assigned to the test group.
We fitted a logistic regression model based on the training group and we calculated the AUC together with 95%-confidence intervals to predict obstructive coronary artery disease as detected in subsequently preformed coronary angiography procedures. When the prediction was based on the ECGs, the AUC for the training group was 0.709 and the AUC for the test group was 0.578. We got an AUC for the training group of 0.595 and for the test group of 0.581 for the prediction of an obstructive CAD from seven risk factors.
The results are reported in Table 17. Though strictly speaking outside the scope of this article, we added the results for the training group. In order to see how the confidence intervals behave on a smaller sample, we applied our methods to a subsample consisting of 100 coronary angiographies. The results are shown in Table 18.
Table 17.
A comparison of different confidence intervals for the AUC for the diagnosis of an obstructive CAD for the full data of 13,538 coronary angiographies via logistic regression models
Method | ECG | ECG | Risk factors | Risk factors |
(training data) | (test data) | (training data) | (test data) | |
AUC | 0.709 | 0.578 | 0.595 | 0.581 |
Corollary 2 | (0.697; 0.721) | (0.562; 0.594) | (0.582; 0.608) | (0.565; 0.597) |
Corollary 3 | (0.697; 0.721) | (0.561; 0.594) | (0.581; 0.608) | (0.565; 0.597) |
DeLong | (0.697; 0.721) | (0.562; 0.594) | (0.582; 0.608) | (0.565; 0.597) |
Modified Wald | (0.698; 0.721) | (0.563; 0.593) | (0.582; 0.607) | (0.566; 0.596) |
Table 18.
A comparison of different confidence intervals for the AUC for the diagnosis of an obstructive CAD for the reduced data of 100 coronary angiographies via logistic regression models
Method | ECG | ECG | Risk factors | Risk factors |
(training data) | (test data) | (training data) | (test data) | |
AUC | 1 | 0.378 | 0.751 | 0.543 |
Corollary 2 | (1; 1) | (0.229; 0.527) | (0.602; 0.901) | (0.346; 0.739) |
Corollary 3 | (1; 1) | (0.244; 0.534) | (0.576; 0.870) | (0.350; 0.724) |
DeLong | (1; 1) | (0.214; 0.542) | (0.575; 0.927) | (0.330; 0.755) |
Modified Wald | (1; 1) | (0.225; 0.531) | (0.607; 0.896) | (0.386; 0.700) |
Table 19.
Neural nets. The further details are the same as in Table 17
Method | ECG | ECG | Risk factors | Risk factors |
(training data) | (test data) | (training data) | (test data) | |
AUC | 0.725 | 0.587 | 0.635 | 0.622 |
Corollary 2 | (0.713; 0.737) | (0.571; 0.604) | (0.622; 0.648) | (0.606; 0.637) |
Corollary 3 | (0.713; 0.737) | (0.571; 0.604) | (0.622; 0.648) | (0.606; 0.637) |
DeLong | (0.713; 0.737) | (0.571; 0.604) | (0.622; 0.648) | (0.606; 0.637) |
Modified Wald | (0.714; 0.736) | (0.572; 0.602) | (0.623; 0.647) | (0.607; 0.636) |
Table 20.
Random forests. The further details are the same as in Table 17
Method | ECG | ECG | Risk factors | Risk factors |
(training data) | (test data) | (training data) | (test data) | |
AUC | 0.999 | 0.599 | 1 | 0.572 |
Corollary 2 | (0.999; 0.999) | (0.583; 0.615) | (1; 1) | (0.556; 0.588) |
Corollary 3 | (0.999; 0.999) | (0.582; 0.615) | (1; 1) | (0.556; 0.588) |
DeLong | (0.999; 0.999) | (0.583; 0.615) | (1; 1) | (0.556; 0.588) |
Modified Wald | (0.999; 1) | (0.584; 0.614) | (1; 1) | (0.557; 0.587) |
Table 21.
Support vector machines. The further details are the same as in Table 17
Method | ECG | ECG | Risk factors | Risk factors |
(training data) | (test data) | (training data) | (test data) | |
AUC | 0.997 | 0.571 | 0.618 | 0.608 |
Corollary 2 | (0.996; 0.998) | (0.555; 0.588) | (0.605; 0.631) | (0.592; 0.624) |
Corollary 3 | (0.996; 0.998) | (0.555; 0.588) | (0.605; 0.631) | (0.592; 0.624) |
DeLong | (0.996; 0.998) | (0.555; 0.588) | (0.605; 0.631) | (0.592; 0.624) |
Modified Wald | (0.996; 0.999) | (0.556; 0.586) | (0.606; 0.63) | (0.593; 0.623) |
For the whole sample all confidence intervals have approximately the same length—the new intervals have the same length as the ones from the literature and the intervals based on the ECGs have the same length as the ones based on the seven risk factors. Not surprisingly, as we reduce the number of observations, the intervals get longer. In particular, for all 13,538 coronary angiographies the logistic regression model is significantly better than a pure random choice (i.e. an AUC of 0.5), which is no longer true if we use only 100 coronary angiographies. For the subsample the new confidence intervals are slightly narrower than the ones from the literature.
In Tables 19–21 we look what happens, when one uses neural nets, random forests or support vector machines instead of logistic regression models. We see that the confidence intervals are slightly shifted due to the different values of the point estimates, but that they all have approximately the same length as the confidence intervals of the logistic regression model.
In order to evaluate the computation times for the confidence intervals, observe that their computation is a two-step procedure. First, the chosen model estimator is used to calculate the fitted values ${\hat{Y}_{2,i}}$, $i=1,\dots ,n$, and in the second step the confidence intervals are calculated from these numbers. So the total computation time of a confidence interval is the sum of one component which does depend on the model estimator, but not on the confidence interval method, and one component which does depend on the confidence interval method, but not on the model estimator. The computation times are reported in Table 22 and Table 23. We see that the computation times for the new intervals are longer than for those from the literature, but that also the computation of the new confidence intervals is feasible. In particular, for random forests and support vector machines the difference between the new computation times and the old ones is neglectable compared to the time needed for the calculation of the fitted values ${\hat{Y}_{2,i}}$, $i=1,\dots ,n$, anyway.
Table 22.
Computation time (in seconds) for the whole sample (13,538 patients)
Method | ECG | ECG | Risk factors | Risk factors |
(training data) | (test data) | (training data) | (test data) | |
logistic regression | 26.23 | 26.13 | 0.56 | 0.49 |
neural net | 4.50 | 4.68 | 2.52 | 2.09 |
random forest | 164.74 | 165.06 | 3.78 | 3.84 |
support vector machines | 856.41 | 857.25 | 181.66 | 177.82 |
AUC | 0.01 | 0.00 | 0.01 | 0.01 |
Corollary 2 | 71.00 | 34.50 | 74.28 | 38.77 |
Corollary 3 | 73.74 | 35.30 | 81.97 | 38.60 |
DeLong | 0.42 | 0.26 | 0.51 | 0.27 |
Modified Wald | 0.01 | 0.00 | 0.01 | 0.00 |
Table 23.
Computation time (in seconds) for the subsample of 100 patients
Method | ECG | ECG | Risk factors | Risk factors |
(training data) | (test data) | (training data) | (test data) | |
logistic regression | 0.44 | 0.42 | 0.01 | 0.02 |
AUC | 0.00 | 0.00 | 0.00 | 0.00 |
Corollary 2 | 0.13 | 0.01 | 0.01 | 0.01 |
Corollary 3 | 0.16 | 0.01 | 0.01 | 0.02 |
DeLong | 0.04 | 0.00 | 0.00 | 0.00 |
Modified Wald | 0.01 | 0.00 | 0.00 | 0.00 |
6 Discussion and outlook
In this paper we have taken into account two facts that are usually ignored in the study of confidence intervals for the AUC. First, only the total size of the test cohort can be controlled, while its splitting into the case and control groups is random. Second, the fitted binarization model is itself subjected to random effects. The first fact brought new confidence intervals that are narrower than the ones in the literature, but have a too low coverage probability at a small sample size. The second fact did not bring new confidence intervals, since we saw that the confidence intervals we got from considering the first fact still had asymptotically the correct coverage probability under Model 2. All what we changed was that we had to add additional parts to the proofs (the ones that we have only for Model 2 and not for Model 1). It can be expected that in a similar manner the confidence intervals proposed in the literature have asymptotically the correct coverage probability under Model 2.
Can it be expected for other binarization algorithms as well that the old confidence intervals still have asymptotically the correct coverage probability when the model uncertainty is taken into account? The estimators in linear discriminant analysis and in quadratic discriminant analysis are combinations of standard estimators. Hence central limit theorems for these estimators are easily established and from there on it is straightforward to generalize the results of the present article. For quadratic discriminant analysis a certain challenge will be that the set of all test points $({X_{1}},{X_{2}})\in {\mathbb{R}^{p}}\times {\mathbb{R}^{p}}$ for which ${Y_{1}}\lt {Y_{2}}$, but ${\hat{Y}_{1}}\gt {\hat{Y}_{2}}$, will be more complicated than for logistic regression models or for linear discriminant analysis. For algorithms from machine learning, like neural nets, random forests and support vector machines, a first problem already occurs in the definition of the theoretical AUC. Since there is only an algorithm and no underlying probability model, we cannot define the theoretical AUC as a probability like we have done for logistic regression models. One could define the theoretical AUC as the average of many independent realizations of the empirical AUC or as the limit of the empirical AUC as the sample size tends to infinity (provided one can show that this limit exists). Still with either of these definitions, the proof will be much harder. The set of all test points $({X_{1}},{X_{2}})\in {\mathbb{R}^{p}}\times {\mathbb{R}^{p}}$ for which ${Y_{1}}\lt {Y_{2}}$, but ${\hat{Y}_{1}}\gt {\hat{Y}_{2}}$, will be much more complicated for a machine learning algorithm than it was in our proof. Moreover, we used a central limit theorem for the estimator in a logistic regression model, and central limit theorems are unknown for machine learning algorithms.
While the theoretical results tell that asymptotically the old confidence intervals work under Model 2 as well, our simulation results tell that at small sample size these confidence intervals may have a seriously too low coverage probability—recall in particular the results in Table 4 for $p=100$. Hence the construction of new confidence intervals is desirable. A tempting idea is to use the δ-method in the same way as we use it in the proof. However, the derivative in Lemma 3 is zero and hence one will end up with the old confidence intervals when using this approach. A solution would be to use the second-order δ-method (see, e.g., [1, Lemma 5]). However, this may appear to be inelegant. The second-order δ-method yields that the limiting distribution of the AUC is a sum of squares of Gaussian random variables, but since not all Gaussian random variables involved in that sum have the same variance, this sum will not be ${\chi ^{2}}$-distributed in general. It is not clear whether a closed-form expression for the variances of these Gaussian random variables can be derived even in the ideal situation when the design points are multivariate-normally distributed or distributed uniformly on the ball. In the realistic situation, when the distribution of the design points is unknown and has to be estimated, it will even be a challenge to propose an algorithm that gives a reasonable approximation for these variances in acceptable time. The results for the LASSO logistic regression ranged from providing a satisfactory solution to being even worse than the pure logistic regression depending on the unknown true model parameter. Bootstrap [3] is known to have good finite-sample properties in many instances and hence would be another approach worth trying. Finally, our simulations in Table 16 show that the estimator $\hat{A}$ is seriously biased for ${A_{2}}$. Hence one can think of constructing an estimator for the bias of $\hat{A}$ for ${A_{2}}$ and then applying bias reduction.
7 The equality of the Mann–Whitney intervals and DeLong’s intervals
Here we prove that the Mann–Whitney intervals due to Sen [13] coincide with DeLong’s intervals [4]. For any real-valued sample ${a_{1}},{a_{2}},\dots ,{a_{N}}$, let ${a_{(1)}},{a_{(2)}},\dots ,{a_{(N)}}$ denote the ordered sample, i.e. the sample containing the same elements (with the same multiplicity), such that ${a_{(1)}}\le {a_{(2)}}\le \cdots \le {a_{(N)}}$. We let ${n_{0}}:=|\{i\in \{1,\dots ,n\}\mid {I_{i}}=0\}|$ denote the number of observations in the control group and ${n_{1}}:=|\{i\in \{1,\dots ,n\}\mid {I_{i}}=1\}|$ the number of observations in the case group. Since Sen [13] and DeLong et al. [4] do not consider the training group, it is needless to say that we only mean observations of the test group here. We let ${X_{i}}$, $i=1,\dots ,{n_{0}}$, be the observations of the control group—not to be confused with the design points of the logistic regression model, for which we used the same symbol—and ${Y_{j}}$, $j=1,\dots ,{n_{1}}$, the observations of the case group. The Mann–Whitney intervals are defined as follows. We let
\[\begin{aligned}{}& {R_{i}}:=\big|\big\{k\in \{1,\dots ,{n_{0}}\}\mid {X_{k}}\le {X_{(i)}}\big\}\big|+\big|\big\{j\in \{1,\dots ,{n_{1}}\}\mid {Y_{j}}\le {X_{(i)}}\big\}\big|,\hspace{2em}\\ {} & \hspace{1em}i=1,\dots ,{n_{0}},\hspace{2em}\\ {} & {S_{j}}:=\big|\big\{i\in \{1,\dots ,{n_{0}}\}\mid {X_{i}}\le {Y_{(j)}}\big\}\big|+\big|\big\{k\in \{1,\dots ,{n_{1}}\}\mid {Y_{k}}\le {Y_{(j)}}\big\}\big|,\hspace{2em}\\ {} & \hspace{1em}j=1,\dots ,{n_{1}},\hspace{2em}\end{aligned}\]
denote the rank of ${X_{(i)}}$ or ${Y_{(j)}}$ respectively within the joint sample of control and case observations. Put
\[\begin{array}{l}\displaystyle \bar{R}:=\frac{1}{{n_{0}}}{\sum \limits_{i=1}^{{n_{0}}}}{R_{i}},\hspace{2em}\bar{S}:=\frac{1}{{n_{1}}}{\sum \limits_{j=1}^{{n_{1}}}}{S_{j}},\\ {} \displaystyle {S_{10}^{2}}=\frac{1}{({n_{0}}-1)\cdot {n_{1}^{2}}}\cdot \Bigg({\sum \limits_{i=1}^{{n_{0}}}}{({R_{i}}-i)^{2}}-{n_{0}}\cdot {\bigg(\bar{R}-\frac{{n_{0}}+1}{2}\bigg)^{2}}\Bigg),\\ {} \displaystyle {S_{01}^{2}}=\frac{1}{({n_{1}}-1)\cdot {n_{0}^{2}}}\cdot \Bigg({\sum \limits_{j=1}^{{n_{1}}}}{({S_{j}}-j)^{2}}-{n_{1}}\cdot {\bigg(\bar{S}-\frac{{n_{1}}+1}{2}\bigg)^{2}}\Bigg),\\ {} \displaystyle {\hat{\sigma }_{M}^{2}}=\frac{{n_{1}}\cdot {S_{10}^{2}}+{n_{0}}\cdot {S_{01}^{2}}}{{n_{0}}\cdot {n_{1}}}.\end{array}\]
Let ${z_{\alpha }}$ be the α-quantile of the standard normal distribution for $\alpha \in (0,1)$. Then
\[ \Big(\hat{A}+{z_{\alpha /2}}\cdot \sqrt{{\hat{\sigma }_{M}^{2}}},\hat{A}+{z_{1-\alpha /2}}\cdot \sqrt{{\hat{\sigma }_{M}^{2}}}\Big)\]
is the Mann–Whitney confidence interval. In order to define DeLong’s intervals, put
\[\begin{array}{l}\displaystyle {V_{10}}(y):=\frac{1}{{n_{1}}}\cdot {\sum \limits_{j=1}^{{n_{1}}}}\bigg({\mathbf{1}_{\{y\lt {Y_{j}}\}}}+\frac{1}{2}\cdot {\mathbf{1}_{\{y={Y_{j}}\}}}\bigg),\\ {} \displaystyle {V_{01}}(y):=\frac{1}{{n_{0}}}\cdot {\sum \limits_{i=1}^{{n_{0}}}}\bigg({\mathbf{1}_{\{{X_{i}}\lt y\}}}+\frac{1}{2}\cdot {\mathbf{1}_{\{{X_{i}}=y\}}}\bigg),\\ {} \displaystyle {\hat{\sigma }_{D}^{2}}=\frac{1}{{n_{0}}\cdot ({n_{0}}-1)}{\sum \limits_{i=1}^{{n_{0}}}}{\big({V_{10}}({X_{i}})-\hat{A}\big)^{2}}+\frac{1}{{n_{1}}\cdot ({n_{1}}-1)}{\sum \limits_{j=1}^{{n_{1}}}}{\big({V_{01}}({Y_{j}})-\hat{A}\big)^{2}}.\end{array}\]
Then DeLong’s interval is
This theorem will be proven in Appendix B.