1 Introduction
1.1 Background
The beta distribution is ubiquitous in statistics due to its flexibility. Among many applications, it is used in analyses of uniform order statistics [17], problems in Euclidean geometry [10], general theory of stochastic processes [24], and applied statistical inference; the last category of applications includes time estimation in project management [5], hypothesis testing [34], A/B testing in business [28], modelling in life sciences [31] and others [19, 20].
Unfortunately, the importance and simplicity do not go in pairs. The distribution of $X\sim \mathsf{Beta}(\alpha ,\beta )$ with parameters $\alpha ,\beta >0$ is given by
where $B(\alpha ,\beta )\triangleq {\textstyle\int _{0}^{1}}{x^{\alpha -1}}{(1-x)^{\beta -1}}\text{d}x$ is the normalizing constant; the integral (1), also known as the incomplete beta function [7], is intractable. Thus, there is a strong demand for closed-form approximations of tail probabilities, for example, in the context of adaptive Bayesian inference [8], Bayesian nonparametric statistics [4], properties of random matrices [10, 26], and (obviously) large deviation theory [33].
(1)
\[ \mathbf{P}\{X\leqslant \epsilon \}={\int _{0}^{\epsilon }}\frac{{x^{\alpha -1}}{(1-x)^{\beta -1}}}{B(\alpha ,\beta )}\text{d}x,\hspace{1em}0\leqslant \epsilon \leqslant 1,\]The main goal of this paper is to give accurate closed-form bounds for the beta distribution in the form of sub-gamma type exponential concentration inequalities [3], that is,
for any $\epsilon \geqslant 0$ and the two parameters: the variance proxy v and the scale c, both depending on α, β. Such concentration bounds, pioneered by Bernstein [2] and popularized by Hoeffding [14], are capable of modelling both the sub-Gaussian and sub-exponential behaviors. Due to this flexibility, bounds of this sort are the working horse of approximation arguments used in modern statistics [18, 3, 22, 30].
(2)
\[ \mathbf{P}\{X-\mathbf{E}[X]<-\epsilon \},\mathbf{P}\{X-\mathbf{E}[X]>\epsilon \}\leqslant \exp \left(-\frac{{\epsilon ^{2}}}{2v+2c\epsilon }\right)\]1.2 Contribution
The contributions of this work are as follows:
-
• Optimal Bernstein-type concentration inequality. We establish an exponential concenration bound (2) with v that matches the variance (which is optimal) and, respectively, the optimal value of c which depends on the ratio of the third and second moment. This bound is shown optimal in the regime of small deviations.
-
• Useful recursion for central moments of the Beta distribution: it is of order 2 with coefficients linearly dependent on the moment order. This formula addresses the lack of a simple closed-form formula for higher-order moments. We use it to estimate the moment generating function.
-
• Implementation and numerical evaluation. Symbolic algebra software has been used to validate numerical comparison, and a numerical evaluation is provided to illustrate the new bounds. The code snippets are shared in the paper, and the evaluation is also shared as a Colab notebook [27].
1.3 Related work
When judging the bounds in the form of (2), it is important to insist on the optimality of the sub-Gaussian behavior. For small deviations ϵ the bound (2) becomes approximately Gaussian with variance v, thus we ideally want ${v^{2}}=\mathbf{Var}[X]$ and exponential bounds of this type are considered optimal in the literature [1]. On the other hand, bounds with ${v^{2}}>\mathbf{Var}[X]$ essentially overshoot the variance, leading to unnecessary wide tails and increasing uncertainty in statistical inference.
Bearing this in mind, we review prior suboptimal bounds in the form of (2):
As for the central moment recursion, we note that:
-
• Folklore methods give some crude bounds, for example, one can express a beta random variable in terms of gamma distributions and utilize their concentration properties; such techniques do not give optimal exponents.
-
• Some bounds in the form of (2) can be derived from established inequalities on the incomplete beta function. Using the well-known inequality [6], which depends on a Kullback–Leibler term, unfortunately leads to suboptimal bounds in the regime of small deviations (as seen from the Taylor approximation, we necessarily obtain the variance factor of ${v^{2}}=\frac{\alpha +\beta +1}{\alpha +\beta }\cdot \mathbf{Var}[X]$, hence overshooting the variance).
-
• The work [10] gives bounds with explicit but suboptimal v and c, only valid in a limited range of deviations $\sqrt{\frac{6\alpha }{{(\alpha +\beta )^{2}}}}<\epsilon <\frac{\alpha }{\alpha +\beta }$.The proof relies on specific integral estimates, and cannot be sharpened much.
-
• The work [21] determines the best bound assuming $c=0$ (that is, of sub-Gaussian type). They are not in a closed form, but can be numerically computed as a solution of a transcendental equation. Although the bound is quite sharp for the symmetric case $\alpha =\beta $, it is much worse than our bound when the beta distribution is skewed (the typical case).
-
• The work [33] obtains suboptimal v and c, shown to be far from the true values by unknown constant factors. With these techniques, it is not possible to obtain the optimal exponent, which is the focus of this work.
Remark 1.
While the current paper was under review, Henzi and Duembgen [13] presented similar tail bounds, obtained by working with integral-derived inequalities. When framed as Bernstein’s inequality, their constant c is worse for the heavier tail (e.g., the right tail in case $\alpha <\beta $), but better on the lighter tail (e.g., $c<0$ for the left tail in case $\alpha <\beta $).
2 Results
2.1 Optimal Bernstein-type concentration bounds
The central result of this work is a Bernstein-type tail bound with numerically optimal parameters. The precise statement is given in the following theorem.
Theorem 1 (Bernstein’s Inequality).
Let $X\sim \mathsf{Beta}(\alpha ,\beta )$. Define the parameters
Then the upper tail of X is bounded as
and the lower tail of X is bounded as
(3)
\[ \begin{aligned}{}v& \triangleq \frac{\alpha \beta }{{(\alpha +\beta )^{2}}(\alpha +\beta +1)},\\ {} c& \triangleq \frac{2\left(\beta -\alpha \right)}{\left(\alpha +\beta \right)\left(\alpha +\beta +2\right)}.\end{aligned}\](4)
\[ \mathbf{P}\left\{X>\mathbf{E}[X]+\epsilon \right\}\leqslant \left\{\begin{array}{l@{\hskip10.0pt}l}\exp \left(-\frac{{\epsilon ^{2}}}{2\left(v+\frac{c\epsilon }{3}\right)}\right),\hspace{1em}& \beta \geqslant \alpha ,\\ {} \exp \left(-\frac{{\epsilon ^{2}}}{2v}\right),\hspace{1em}& \beta <\alpha ,\end{array}\right.\](5)
\[ \mathbf{P}\left\{X<\mathbf{E}[X]-\epsilon \right\}\leqslant \left\{\begin{array}{l@{\hskip10.0pt}l}\exp \left(-\frac{{\epsilon ^{2}}}{2\left(v+\frac{c\epsilon }{3}\right)}\right),\hspace{1em}& \alpha \geqslant \beta ,\\ {} \exp \left(-\frac{{\epsilon ^{2}}}{2v}\right),\hspace{1em}& \alpha <\beta .\end{array}\right.\]Remark 2 (Variance and Scale Parameters).
The variance parameter equals $v=\mathbf{Var}[X]$, and the scale parameter equals $c=\frac{\mathbf{E}[{(X-\mathbf{E}[X])^{3}}]}{\mathbf{Var}[X]}$.
Remark 3 (Mixed Gaussian-Exponential Behaviour).
For small values of ϵ the bound behaves like Gaussian with variance $v=\mathbf{Var}[X]$. For bigger values of ϵ, the bound is close to the exponential tail with parameter $\frac{2c}{3v}$.
The result below shows that the parameters c and v are best possible, in the regime of small deviations ϵ, for the exponential moment method.
Theorem 2 (Best Cramér–Chernoff Bound).
The best bound that can be obtained from the Cramér–Chernoff method is
as $\epsilon \to 0$, with constants c, v as in Theorem 1.
(6)
\[ \mathbf{P}\left\{X>\mathbf{E}[X]+\epsilon \right\}\leqslant {\mathrm{e}^{-\frac{{\epsilon ^{2}}}{2v}+\frac{c{\epsilon ^{3}}}{6{v^{2}}}+O\left({\epsilon ^{4}}\right)}},\]2.2 Handy recurrence for central moments
Our optimal Bernstein inequality is proved using the novel recursion for central moments, presented below in Theorem 3 as the contribution of independent interest. Being of order 2 it is not only efficient to evaluate numerically, but also easy to manipulate algebraically when working with closed-form formulas.
Theorem 3 (Order 2 Recurrence for Central Moments).
For $X\sim \mathsf{Beta}(\alpha ,\beta )$ and any integer order $d\geqslant 2$, the following recurrence relation holds:
(7)
\[ \begin{aligned}{}\mathbf{E}[{(X-\mathbf{E}[X])^{d}}]& =\frac{(d-1)(\beta -\alpha )}{(\alpha +\beta )(\alpha +\beta +d-1)}\cdot \mathbf{E}[{(X-\mathbf{E}[X])^{d-1}}]\\ {} & \hspace{1em}+\frac{(d-1)\alpha \beta }{{(\alpha +\beta )^{2}}(\alpha +\beta +d-1)}\cdot \mathbf{E}[{(X-\mathbf{E}[X])^{d-2}}].\end{aligned}\]Corollary 2 (P-recursive Property).
The recursion given in Theorem 3, after rearrangements, is of order 2 with coefficients linear in the index d.
The implementation of the recursion, performed in Python’s symbolic algebra package Sympy [23], is presented in Listing 1. Scalability is ensured by memoization, which caches intermediate results to avoid repeated calls.
To demonstrate the algorithm in action, we list some first central moments of the beta distribution in Table 1.
Central Moment | Explicit Formula |
$\mathbf{E}[X]$ | $\frac{\alpha }{\alpha +\beta }$ |
$\mathbf{Var}[X]$ | $\frac{\alpha \beta }{{(\alpha +\beta )^{2}}(\alpha +\beta +1)}$ |
$\textbf{Skew}[X]$ | $\frac{2\left(\beta -\alpha \right)\sqrt{\alpha +\beta +1}}{\sqrt{\alpha \beta }\left(\alpha +\beta +2\right)}$ |
$\textbf{Kurt}[X]$ | $\frac{3\left(\alpha \beta \left(\alpha +\beta +2\right)+2{\left(\alpha -\beta \right)^{2}}\right)\left(\alpha +\beta +1\right)}{\alpha \beta \left(\alpha +\beta +2\right)\left(\alpha +\beta +3\right)}$ |
$\mathbf{E}[{(X-\mathbf{E}[X])^{5}}]/{\sqrt{\mathbf{Var}[X]}^{5}}$ | $\frac{4\left(\beta -\alpha \right){\left(\alpha +\beta +1\right)^{\frac{3}{2}}}\left(3\alpha \beta \left(\alpha +\beta +2\right)+2\alpha \beta \left(\alpha +\beta +3\right)+6{\left(\alpha -\beta \right)^{2}}\right)}{{\alpha ^{\frac{3}{2}}}{\beta ^{\frac{3}{2}}}\left(\alpha +\beta +2\right)\left(\alpha +\beta +3\right)\left(\alpha +\beta +4\right)}$ |
Note that the algorithm addresses the lack of simple derivations for such formulas, as well as the lack of formulas beyond skewness and kurtosis.
Finally, we note that the recurrence implies, by induction, the following important property:
2.3 Numerical evaluation
The experiment summarized in Figure 1 (the code shared in [27]) demonstrates the advantage of sub-gamma bounds obtained in Theorem 1 over the optimal sub-Gaussian bounds from prior work [21].
For skewed beta distributions, the bound from this work is more accurate than the sub-Gaussian approximation. The chosen range of parameters covers the cases where the expectation $\mathbf{E}[X]$ is a relatively small number like 0.02 or 0.2, a typical range for many practical Beta models, particularly A/B testing. Note that there is still room for improvement in the regime of larger deviations ϵ, where the bounds could potentially benefit from refining the numeric value of c. The experiment details are shared in the Colab Notebook [27].
3 Preliminaries
3.1 Gaussian hypergeometric function
The Gaussian hypergeometric function is defined as follows [11]:
where we use the Pochhammer symbol defined as
We call two functions F of this form contiguous when their parameters differ by integers. Gauss considered ${_{2}}{F_{1}}({a^{\prime }},{b^{\prime }};{c^{\prime }};,z)$ where ${a^{\prime }}=a\pm 1$, ${b^{\prime }}=b\pm 1$, ${c^{\prime }}=c\pm 1$ and proved that between F and any two of these functions, there exists a linear relationship with coefficients linear in z. It follows [29, 15] that F and any two of its contiguous series are linearly dependent, with the coefficients being rational in parameters and z. For our purpose, we need to express F by the series with increased second argument. The explicit formula comes from [32]:
(8)
\[ {_{2}}{F_{1}}(a,b;c;,z)={\sum \limits_{k=0}^{+\infty }}\frac{{(a)_{k}}{(b)_{k}}}{{(c)_{k}}}\cdot \frac{{z^{k}}}{k!},\](9)
\[ {(x)_{k}}=\left\{\begin{array}{l@{\hskip10.0pt}l}1,\hspace{1em}& k=0,\\ {} x(x+1)\cdots (x+k-1),\hspace{1em}& k>0.\end{array}\right.\]3.2 Beta distribution properties
We use the machinery of hypergeometric functions to establish certain properties of the beta distribution. The first result expresses the central moment in terms of the Gaussian hypergeometric function.
Lemma 2 (Central Beta Moments).
Let $X\sim \mathsf{Beta}(\alpha ,\beta )$, then we have that
where ${_{2}}{F_{1}}$ is the Gaussian hypergeometric function.
(11)
\[ \mathbf{E}[{(X-\mathbf{E}[X])^{d}}]={\left(-\frac{\alpha }{\alpha +\beta }\right)^{d}}{\hspace{2.5pt}_{2}}{F_{1}}\left(\alpha ,-d;\alpha +\beta ;\frac{\alpha +\beta }{\alpha }\right),\]The proof appears in Section 4.1.
3.3 Cramér–Chernoff method
Below, we review the canonical method of obtaining concentration inequalities from the moment generating function, following the discussion in [3].
Recall that for a given random variable Z we define its moment generating function (MGF) and, respectively, cumulant generating function as
In this notation, Markov’s exponential inequality becomes
valid for any ϵ and any nonnegative t, and nontrivial when ${\phi _{Z}}(t)$ is finite. The optimal value of t is determined by the so-called Cramér transform:
and leads to the Chernoff inequality
(12)
\[ \begin{aligned}{}{\phi _{Z}}(t)& \triangleq \mathbf{E}\left[\exp (tZ)\right],\\ {} {\psi _{Z}}(t)& \triangleq \log \mathbf{E}\left[\exp (tZ)\right].\end{aligned}\](13)
\[ \mathbf{P}\left\{Z\geqslant \epsilon \right\}\leqslant \exp (-t\epsilon )\mathbf{E}\left[\exp (tZ)\right]=\exp (-t\epsilon +{\psi _{Z}}(t)),\]3.4 Logarithmic inequalities
It is well known that $\log (1+x)=x+\frac{{x^{2}}}{2}+\cdots \hspace{0.1667em}$ for nonnegative x. Below, we present a useful refinement of this expansion:
The bound is illustrated in Figure 2 below, and we see that it matches up to the term $O({x^{4}})$. The proof appears in Section 4.3.
Fig. 2.
The logarithmic inequality from Lemma 3
4 Proofs
4.1 Proof of Lemma 2
We know that the raw higher-order moments of X are given by [16]
Combining this with the binomial theorem, we obtain
Finally, by $\left(\genfrac{}{}{0.0pt}{}{d}{k}\right)={(-1)^{k}}\frac{{(-d)_{k}}}{k!}$ and the definition of the hypergeometric function,
which finishes the proof.
(18)
\[ \begin{aligned}{}\mathbf{E}[{(X-\mathbf{E}[X])^{d}}]& ={\sum \limits_{k=0}^{d}}{(-1)^{d-k}}\left(\genfrac{}{}{0.0pt}{}{d}{k}\right)\mathbf{E}[{X^{k}}]{\left(\frac{\alpha }{\alpha +\beta }\right)^{d-k}}\\ {} & ={\sum \limits_{k=0}^{d}}{(-1)^{d-k}}\left(\genfrac{}{}{0.0pt}{}{d}{k}\right)\frac{{(\alpha )_{k}}}{{(\alpha +\beta )_{k}}}{\left(\frac{\alpha }{\alpha +\beta }\right)^{d-k}}.\end{aligned}\](19)
\[ \begin{aligned}{}\mathbf{E}[{(X-\mathbf{E}[X])^{d}}]& ={(-1)^{d}}{\sum \limits_{k=0}^{d}}\frac{{(-d)_{k}}{(\alpha )_{k}}}{k!{(\alpha +\beta )_{k}}}{\left(\frac{\alpha }{\alpha +\beta }\right)^{d-k}}\\ {} & ={\left(-\frac{\alpha }{\alpha +\beta }\right)^{d}}{\sum \limits_{k=0}^{d}}\frac{{(-d)_{k}}{(\alpha )_{k}}}{k!{(\alpha +\beta )_{k}}}{\left(\frac{\alpha +\beta }{\alpha }\right)^{k}}\\ {} & ={\left(-\frac{\alpha }{\alpha +\beta }\right)^{d}}{\hspace{2.5pt}_{2}}{F_{1}}\left(\alpha ,-d;\alpha +\beta ;\frac{\alpha +\beta }{\alpha }\right),\end{aligned}\]4.2 Proof of Theorem 3
The goal is to prove the recursion formula (10). One way to accomplish this is to reuse the summation formulas developed in the proof of Lemma 2. Below, we give an argument that uses the properties of hypergeometric functions to better highlight their connection to the beta distribution.
To simplify the notation, we define $a=\alpha $, $b=-d$, $c=\alpha +\beta $, and $z=\frac{\alpha +\beta }{\alpha }$. Define ${\mu _{d}}\triangleq \mathbf{E}[{(X-\mathbf{E}[X])^{d}}]$. Then using Lemma 2 and Lemma 1 we obtain
In terms of α, β, d we obtain
The computations are done in SymPy package [23], as shown in Listing 2.
(20)
\[ \begin{aligned}{}{(-z)^{d}}\cdot {\mu _{d}}& {=_{2}}{F_{1}}(a,b;c;z)\\ {} & =\frac{2b-c+2+(a-b-1)z}{b-c+1}{\hspace{0.1667em}_{2}}{F_{1}}(a,b+1;c;,z)\\ {} & \hspace{1em}+\frac{(b+1)(z-1)}{b-c+1}{\hspace{0.1667em}_{2}}{F_{1}}(a,b+2;c;,z).\end{aligned}\](21)
\[ \begin{aligned}{}\frac{2b-c+2+(a-b-1)z}{b-c+1}& =(d-1)\cdot \frac{\alpha -\beta }{\alpha (\alpha +\beta +d-1)},\\ {} \frac{(b+1)(z-1)}{b-c+1}& =(d-1)\cdot \frac{\beta }{\alpha (\alpha +\beta +d-1)}.\end{aligned}\]Since we have
it follows that
Recalling that $z=\frac{\alpha +\beta }{\alpha }$ we finally obtain
which finishes the proof.
(22)
\[ \begin{aligned}{}{_{2}}{F_{1}}(a,b+1;c;,z)& ={\hspace{2.5pt}_{2}}{F_{1}}(a,-d+1;c;,z)={\mu _{d-1}}\cdot {(-z)^{d-1}},\\ {} {_{2}}{F_{1}}(a,b+2;c;,z)& ={\hspace{2.5pt}_{2}}{F_{1}}(a,-d+2;c;,z)={\mu _{d-2}}\cdot {(-z)^{d-2}},\end{aligned}\](23)
\[ {z^{2}}{\mu _{d}}=-\frac{z(d-1)(\alpha -\beta )}{\alpha (\alpha +\beta +d-1)}\cdot {\mu _{d-1}}+\frac{(d-1)\beta }{\alpha (\alpha +\beta +d-1)}\cdot {\mu _{d-2}}.\]4.3 Proof of Lemma 3
Consider the function $f(x)=\log (1+x)-\Big(x-\frac{{x^{2}}}{2(1+\frac{2x}{3})}\Big)$ for nonnegative x. Using Sympy we find that
as demonstrated in Listing 3. Thus, f is nondecreasing; since $f(0)=0$ we obtain $f(x)\leqslant 0$ as claimed.
4.4 Proof of Theorem 1
If $X\sim \mathsf{Beta}(\alpha ,\beta )$, then for ${X^{\prime }}=1-X$ we obtain $X-\mathbf{E}[X]=\mathbf{E}[{X^{\prime }}]-{X^{\prime }}$ and thus the upper tail of X equals the lower tail of ${X^{\prime }}$, and the lower tail of X equals the upper tail of ${X^{\prime }}$. Furthermore, from (1) it follows that ${X^{\prime }}\sim \mathsf{Beta}(\beta ,\alpha )$. Thus, by exchanging the roles of α and β accordingly, it suffices to prove the theorem for the upper tail.
To facilitate calculations, we introduce the centred version of X:
With the help of normalized moments
we expand the moment generating function of Z into the series
(which converges everywhere as $|Z|\leqslant 1$), and expand its derivative as
(everywhere, accordingly).
(28)
\[ \phi (t)\triangleq \mathbf{E}\exp (tZ)=1+{\sum \limits_{d=2}^{+\infty }}{m_{d}}{t^{d}},\hspace{1em}t\in \mathbb{R}\]The proof strategy is as follows: we start by expressing the cumulant generating function as an integral involving the moment generating function and its derivative; then we study series expansions of the functions involved, using the moment recursion; this leads to a tractable upper-bound on the integral; finally, the cumulant bound is optimized by the Cramér–Chernoff method. To achieve the promised optimal constants, it is critical to keep all intermediate estimates sharp up to the order of 4. The proof flow is illustrated on Figure 3.
Observe that the cumulant generating function ψ can be expressed in terms of the moment generating function ϕ as follows.
Proof.
This follows by the logarithmic derivative identity: $\log {\phi ^{\prime }}=\frac{{\phi ^{\prime }}}{\phi }$. □
We now present a recursion for coefficients of the series expansions.
The relation between ${\phi ^{\prime }}$ and ϕ is given in the following auxiliary result.
We are now ready to estimate the cumulant generating function.
Claim 3.
For nonnegative t and $c=\frac{\beta -\alpha }{(\alpha +\beta )(\alpha +\beta +2)}$ it holds that
(33)
\[ \begin{array}{r@{\hskip0pt}l@{\hskip10pt}r}\displaystyle \frac{{\phi ^{\prime }}(t)}{\phi (t)}\leqslant \frac{vt}{1-ct},& & \displaystyle \alpha \leqslant \beta \textit{and}\hspace{2.5pt}t<\frac{1}{c},\\ {} \displaystyle \phi (t)\leqslant \exp \left(\frac{v{t^{2}}}{2}\right),& & \displaystyle \alpha >\beta .\end{array}\]Proof of Claim.
The proof splits depending on the relation between α and β. The case $\alpha >\beta $ is a bit easier and leads to sub-Gaussian bounds.
Case $\alpha >\beta $: From Corollary 3 and (31) it follows that ${m_{d}}$ is nonnegative when d is even, and negative otherwise. Thus, for even d we have
Repeating this $d/2$ times and combining with ${m_{d}}\leqslant 0$ for odd d, we obtain
Using $d!!={2^{d/2}}(d/2)!$ for even d, for $t\geqslant 0$ we obtain
as required.
(34)
\[ {m_{d}}\leqslant \frac{1}{d}\cdot \frac{\alpha \beta }{\alpha +\beta +d-1}{m_{d-2}}\leqslant \frac{v}{d}\cdot {m_{d-2}}.\](35)
\[ {m_{d}}\leqslant \left\{\begin{array}{l@{\hskip10.0pt}l}\frac{{v^{\frac{d}{2}}}}{d!!},\hspace{1em}& d\hspace{2.5pt}\text{even},\\ {} 0,\hspace{1em}& d\hspace{2.5pt}\text{odd}.\end{array}\right.\]
Case $\alpha \leqslant \beta $: By (28) and (29) for any t and any c (to be determined later)
where we use the fact that the expansion terms with 1, t, ${t^{2}}$ vanish.
(37)
\[ (1-ct){\phi ^{\prime }}(t)-vt\phi (t)={\sum \limits_{d=3}^{+\infty }}\left(d{m_{d}}-(d-1)c{m_{d-1}}-v{m_{d-2}}\right){t^{d-1}},\]Using (31) to eliminate the term with ${m_{d-2}}$, and expressing v in terms of α, β, we obtain
(38)
\[\begin{aligned}{}& d{m_{d}}-(d-1)c{m_{d-1}}-v{m_{d-2}}\\ {} & \hspace{1em}=-\frac{d\left(d-2\right)}{\alpha +\beta +1}{m_{d}}+\frac{\left(1-d\right)\left(\alpha -\beta +c\left({\alpha ^{2}}+2\alpha \beta +\alpha +{\beta ^{2}}+\beta \right)\right)}{{\alpha ^{2}}+2\alpha \beta +\alpha +{\beta ^{2}}+\beta }{m_{d-1}}.\end{aligned}\]Since we assume $t\geqslant 0$, it suffices to show that (38) is nonpositive for $d\geqslant 3$, for c defined as in Theorem 1.
Since $\alpha \leqslant \beta $, we have that ${m_{d}}\geqslant 0$ for all d by induction in (31). In particular, ${m_{d-2}}\geqslant 0$, and from (31) we obtain the bound
(note that the bound is sharp for $d=3$) which together with (38) yields
The last expression is linear in c, with negative coefficient (because ${m_{d-1}}\geqslant 0$ and $d\geqslant 3$). Thus, it is nonpositive if and only if the condition
holds for all $d\geqslant 3$. Since $\beta -\alpha \geqslant 0$, this is equivalent to
and under this condition, (38) is nonpositive. Replacing in this reasoning c with $\frac{c}{2}$, to align with Theorem 1, finishes the proof.
(39)
\[ {m_{d}}\geqslant \frac{\left(\beta -\alpha \right)\left(d-1\right){m_{d-1}}}{d\left(\alpha +\beta \right)\left(\alpha +\beta +d-1\right)}\](40)
\[\begin{aligned}{}& d{m_{d}}-(d-1)c{m_{d-1}}-v{m_{d-2}}\\ {} & \hspace{1em}\leqslant c\left(1-d\right){m_{d-1}}+\frac{\left(-\alpha d+\alpha +\beta d-\beta \right){m_{d-1}}}{{\alpha ^{2}}+2\alpha \beta +\alpha d-\alpha +{\beta ^{2}}+\beta d-\beta }.\end{aligned}\](41)
\[ c\geqslant \frac{\beta -\alpha }{\left(\alpha +\beta \right)\left(\alpha +\beta +d-1\right)}\](42)
\[ c\geqslant \frac{\beta -\alpha }{\left(\alpha +\beta \right)\left(\alpha +\beta +2\right)},\]The calculations are done in Sympy and shown in Listing 4. □
Proof.
This follows from the previous claim, by integrating the first branch of the bound and, respectively, by taking the logarithm of the second branch. □
Remark 5.
Note that this bound, in case $\alpha \leqslant \beta $, gives $\psi (t)\leqslant 1+\frac{{t^{2}}v}{2}+\frac{(\beta -\alpha ){t^{3}}v}{3(\alpha +\beta +2)}+O\left({t^{4}}\right)$, and comparing this with the actual expansion $\psi (t)=1+\frac{{t^{2}}\mathbf{Var}[X]}{2}+\frac{{t^{3}}\mathbf{E}[{(X-\mathbf{E}[X])^{3}}]}{6}+O({t^{4}})$ we find that the bound is sharp up to $O({t^{4}})$.
Finally, in the next claim, we present the Chernoff inequality.
Proof.
With the help of a computer algebra package, we plug the upper bound on $\psi (t)$ from Claim 4 into the Cramér–Chernoff bound, and find that it is maximized at ${t_{\mathit{best}}}=\frac{\epsilon }{c\epsilon +v}$ when $0 < t < 1/c$. This yields the bound
The computations are done in Sympy, as shown in Listing 5.
(45)
\[ \begin{aligned}{}{\psi ^{\ast }}(\epsilon )& \triangleq \sup \{t\epsilon -\psi (t):0\leqslant t\}\\ {} & \geqslant \epsilon {t_{\mathit{best}}}-\psi ({t_{\mathit{best}}})\\ {} & =\frac{v}{{c^{2}}}\cdot \left(\frac{c\epsilon }{v}-\log \left(\frac{c\epsilon }{v}+1\right)\right).\end{aligned}\]Using (16) with $x=\frac{c\epsilon }{v}$ we finally obtain
which finishes the proof. □
4.5 Proof of Theorem 2
Fix $\epsilon \geqslant 0$. Denote $Z=X-\mathbf{E}[X]$, and let $\phi (t)=\mathbf{E}[\exp (tZ)]$, $\psi (t)=\log \phi (t)$ be, respectively, the moment and cumulant generating functions of Z. By Jensen’s inequality $\phi (t)\geqslant \exp (t\mathbf{E}[Z])=1$, and so $\psi (t)\geqslant 0$ and $t\epsilon -\psi (t)\leqslant 0$ for $t\leqslant 0$. Since $\psi (0)=0$ we conclude that
which shows that the Cramér–Chernoff exponent equals the global maximum of the function $t\to t\epsilon -\psi (t)$ (the Legendre–Fenchel transform), achieved for $t\geqslant 0$.
(47)
\[ {\psi ^{\ast }}(t)=\sup \left\{t\epsilon -\psi (t):t\geqslant 0\right\}=\sup \left\{t\epsilon -\psi (t):t\in \mathbb{R}\right\},\]It is known that the cumulant generating function is convex, and strictly convex for nonconstant random variables [3]. Thus, $\psi (t)$ is strictly convex. As a consequence, we see that the function $t\to t\epsilon -\psi (t)$ is strictly concave. Thus, (47) is maximized at the value t which is a unique solution to
The last equation defines $t=t(\epsilon )$ implicitly and seems to be not solvable with elementary functions. Nevertheless, the Lagrange inversion theorem ensures that t is analytic as a function of ϵ and gives initial terms of the series expansion. More precisely, if $y=f(x)$ where f is analytic at 0 and $f(0)=0$, ${f^{\prime }}(0)\ne 0$ then x expands into a power series in y around $y=0$, with the coefficient of the term ${y^{k}}$ equal to $\frac{1}{k}$ times the coefficient of ${x^{k-1}}$ in the series expansion of ${\left(x/f(x)\right)^{k}}$ when $k>0$ and 0 when $k=0$; for a reference, see for example, [9]. Applying this to $y=\epsilon $, $x=t$ and $f={\psi ^{\prime }}$ we obtain
where we do computations in Sympy as shown in Listing 6. Note that we use only up to three terms of the expansion of f, since we are interested in the first two terms for x.
Therefore, we obtain the following exponent in the Chernoff inequality:
which finishes the proof of the formula in Theorem 2. This expansion matches the expansion of the exponent in Theorem 1 up to $O({\epsilon ^{4}})$, which proves the optimality of the numeric constants.