Modern Stochastics: Theory and Applications logo


  • Help
Login Register

  1. Home
  2. Issues
  3. Volume 11, Issue 1 (2024)
  4. A quantitative functional central limit ...

Modern Stochastics: Theory and Applications

Submit your article Information Become a Peer-reviewer
  • Article info
  • Full article
  • Related articles
  • Cited by
  • More
    Article info Full article Related articles Cited by

A quantitative functional central limit theorem for shallow neural networks
Volume 11, Issue 1 (2024), pp. 85–108
Valentina Cammarota   Domenico Marinucci   Michele Salvi ORCID icon link to view author Michele Salvi details   Stefano Vigogna  

Authors

 
Placeholder
https://doi.org/10.15559/23-VMSTA238
Pub. online: 28 November 2023      Type: Research Article      Open accessOpen Access

Received
5 July 2023
Revised
23 October 2023
Accepted
15 November 2023
Published
28 November 2023

Abstract

We prove a quantitative functional central limit theorem for one-hidden-layer neural networks with generic activation function. Our rates of convergence depend heavily on the smoothness of the activation function, and they range from logarithmic for nondifferentiable nonlinearities such as the ReLu to $\sqrt{n}$ for highly regular activations. Our main tools are based on functional versions of the Stein–Malliavin method; in particular, we rely on a quantitative functional central limit theorem which has been recently established by Bourguin and Campese [Electron. J. Probab. 25 (2020), 150].

1 Introduction and background

In this paper we shall be concerned with one-hidden-layer neural networks with Gaussian random weights, that is, random fields $F:{\mathbb{S}^{d-1}}\to \mathbb{R}$ of the form
(1)
\[ F(x)=\frac{1}{\sqrt{n}}{\sum \limits_{j=1}^{n}}{V_{j}}\sigma \Bigg({\sum \limits_{\ell =1}^{d}}{W_{j\ell }}{x_{\ell }}\Bigg)=\frac{1}{\sqrt{n}}{\sum \limits_{j=1}^{n}}{V_{j}}\sigma ({W_{j}}x),\]
where ${V_{j}}\in \mathbb{R}$, ${W_{j}}\in {\mathbb{R}^{1\times d}}$ are, respectively, random variables and vectors, whose entries are independent Gaussian with zero mean and variance $\mathbb{E}[{V_{j}^{2}}]=\mathbb{E}[{W_{j\ell }^{2}}]=1$, $j=1,\dots ,n$, $\ell =1,\dots ,d$. Here $\sigma :\mathbb{R}\to \mathbb{R}$ is an activation function whose properties and form we will discuss below, the $\sigma ({W_{j}}x)$ represent the artificial neurons, and n is their number, namely the width of the network. The random field F is defined on the unit sphere ${\mathbb{S}^{d-1}}$, with zero mean and covariance function
(2)
\[ S({x_{1}},{x_{2}}):=\mathbb{E}\big[F({x_{1}})F({x_{2}})\big]=\mathbb{E}\big[\sigma ({W_{j}}{x_{1}})\sigma ({W_{j}}{x_{2}})\big],\hspace{1em}{x_{1}},{x_{2}}\in {\mathbb{S}^{d-1}}.\]
The covariance function S also defines a zero mean Gaussian random field $Z:{\mathbb{S}^{d-1}}\to \mathbb{R}$, which gives the asymptotic distribution of F for $n\to \infty $ [23].
Our aim is to establish a quantitative functional central limit theorem for the network F as the number of neurons n increases, that is, to study the distance, under a suitable functional probability metric ${d_{2}}$, between F and Z as a function of n. In particular, we shall obtain bounds of the form
\[ {d_{2}}(F,Z)\le b(n,\alpha ),\]
where ${\lim \nolimits_{n\to \infty }}b(n,\alpha )\to 0$ and α is a parameter capturing the smoothness of the activation σ. For the sake of brevity and simplicity, in this paper we restrict our attention to univariate neural networks; the extension to the multivariate case can be obtained along similar lines, up to a factor depending on the output dimension.
The distribution of neural networks in the large-width limit is a classical topic in learning theory, the first result going back to the seminal work [23]. The subject has gained considerable attention in the machine learning community as it can shed light on the network training process and draw links with kernel-based learning. Neural networks are usually optimized by (variants of) gradient descent with a random initial condition, hence, at initialization, they can be seen as random fields. In many applications, the Gaussian is in fact the distribution of choice, which leads (in the shallow case) to the model considered in (1). On the other hand, taking large, over-parametrized architectures has become an established practice, achieving impressive empirical performance in spite of classical statistical knowledge that would warn from the risks of overfitting [30, 4]. For these reasons – (Gaussian) random initialization and over-parametrization – central limit theorems of neural networks to infinite width provide useful information on the distribution of a typical network at the beginning of its training. In particular, the Gaussian limit reveals random neural networks as approximations of kernel methods associated with random-features kernels, that is, kernels of the form (2) [27, 2]. Interestingly, some of this information may even carry over beyond initialization. Indeed, it has been observed that, under a proper scaling limit, the evolution of the network through training is well approximated by its linearization around the initial condition, and it is governed by a kernel, called neural tangent kernel, which adds higher-order correlations to the random features (2) [18, 9]. In this lazy regime, the weights do not move too much from their random initialization, and thus the network from its central limit.
As we already recalled, the story of central limit theorems for neural networks starts with [23], which gave the proof for a single hidden layer. This first result was later generalized to deep networks, leading to an extensive picture in [15]. Non-Gaussianity at finite-width perturbations have been investigated studying higher-order cumulants in [28, 29]. Quantitative central limit theorems in suitable probability metrics have been considered only very recently in [3] and [6]. In [3] the authors have proved a finite-dimensional quantitative central limit theorem for neural networks of finite depth whose activation functions satisfy a Lipschitz condition; in [6], the authors have proved second-order Poincaré inequalities (which imply one-dimensional quantitative central limit theorems) for neural networks with ${C^{2}}$ activation functions.
Understanding the Gaussian behavior of a neural network allows, for instance, to investigate the geometry of its landscape, e.g., the cardinality of its minima, the number of nodal components and many other quantities of interest. However, convergence of the finite-dimensional distributions is in general not sufficient to constraint such landscapes. For this reason, functional results, that is, bounds on the speed of convergence in functional spaces, are also of great interest. So far, the literature on quantitative functional central limit theorems is still limited: [13] and [19] have focused on one-hidden-layer networks, where the random coefficients in the inner layer are Gaussian for [13] and uniform on the sphere for [19], whereas the coefficients in the outer layer follow a Rademacher distribution for both. In particular, the authors in [13] manage to establish rates of convergence in Wasserstein distance which are (power of) logarithmic for ReLu and other activation functions, and algebraic for polynomial or very smooth activations, see below for more details. On the other hand, the rates in [19] for ReLu networks are of the form $O({n^{-\frac{1}{2d-1}}})$; this is algebraic for fixed values of d, but it can actually converge to zero more slowly than the inverse of a logarithm if d is of the same order as n, as it is the case for many applications.

1.1 Purpose and plan of the paper

We consider in this work functional quantitative central limit theorems under general activations and for coefficients that are Gaussian for both layers, which seems the most relevant case for applications; our approach is largely based upon very recent results by [8] on the Stein–Malliavin techniques for random elements taking values in Hilbert spaces (we refer to [24, 25] for the general foundations of this approach, together with [20, 7, 1, 12] for some more recent references). Our main results are collected in Section 2, whereas their proofs with a few technical lemmas are given in Section 4. A short comparison with the existing literature is provided in Section 3. Appendix A is mainly devoted to background results which we heavily exploit throughout the paper.
Notation. Hereafter, we will write ${a_{n}}\sim {b_{n}}$ for two positive sequences such that ${\lim \nolimits_{n\to \infty }}{a_{n}}/{b_{n}}=1$. The expression $A\lesssim B$ means that $A\le CB$ for some absolute constant $C\gt 0$. We will denote by $\| \cdot \| $ the ${L^{2}}$ norm corresponding to the uniform probability measure on the unit sphere ${\mathbb{S}^{d-1}}$.

2 Main results

In order to state our main theorems, we shall need some further assumptions and notations. We shall always be concerned with activation functions which are square integrable with respect to the standard Gaussian measure, i.e., such that
\[ \mathbb{E}\big[{\sigma ^{2}}(\zeta )\big]\lt \infty ,\hspace{2.5pt}\zeta \sim N(0,1);\]
this is a truly minimal conditions, which is guaranteed by $\sigma (z)=O(\exp ({z^{2}}/(2+\delta ))$ for all $\delta \gt 0$. For such activation functions, it is well known that the following Hermite expansion holds, in the ${L^{2}}$ sense with respect to the Gaussian measure (see, e.g., [25]):
\[ \sigma (x)={\sum \limits_{q=0}^{\infty }}{J_{q}}(\sigma )\frac{{H_{q}}(x)}{\sqrt{q!}},\hspace{1em}\text{with}\hspace{2.5pt}{H_{q}}(x):={(-1)^{q}}{e^{\frac{{x^{2}}}{2}}}\frac{{d^{q}}}{d{x^{q}}}{e^{-\frac{{x^{2}}}{2}}},\]
where ${\{{H_{q}}\}_{q=0,1,2,\dots ,}}$ is the well-known sequence of Hermite polynomials. The coefficients ${J_{q}}(\sigma )$, which will play a crucial role in our arguments below, are defined according to the following (normalized) projection:
\[ {J_{q}}(\sigma ):=\frac{1}{\sqrt{q!}}\mathbb{E}\big[\sigma (\zeta ){H_{q}}(\zeta )\big].\]
In the following, when no confusion is possible, we may drop the dependence of J on σ for ease of notation. We remark that our notation is to some extent nonstandard, insofar we have introduced the factor $\frac{1}{\sqrt{q!}}$ inside the projection coefficient $\mathbb{E}[\sigma (\zeta ){H_{q}}(\zeta )]$; equivalently, we are defining the projection coefficients in terms of Hermite polynomials which have been normalized to have unit variance. Indeed, it is well known that
\[ \mathbb{E}\bigg[{\bigg(\frac{{H_{q}}(\zeta )}{\sqrt{q!}}\bigg)^{2}}\bigg]=\frac{1}{q!}\mathbb{E}\big[{\big({H_{q}}(\zeta )\big)^{2}}\big]=1.\]
In short, our main results state that a quantitative functional central limit theorem for neural networks built on σ holds, and the rate of convergence depends on the rate of decay of $\{{J_{q}}(\sigma )\}$, as $q\to \infty $; roughly put, it is logarithmic when this rate is polynomial (e.g., the ReLu case), whereas convergence occurs at algebraic rates for some activation functions which are smoother, with exponential decay of the coefficients. A more detailed discussion of these results and comparisons with the existing literature are given below in Section 3.
Let us discuss an important point about normalization. In this paper, the measure on the sphere ${\mathbb{S}^{d-1}}$ is normalized to have unit volume. The bound we obtain are not invariant to this normalization, and indeed they would be much tighter if the measure on the sphere was taken as usual to be ${s_{d}}=\frac{2{\pi ^{d/2}}}{\Gamma (\frac{d}{2})}$, the surface volume of ${\mathbb{S}^{d-1}}$. Indeed, by Stirling’s formula
\[ {s_{d}}=\frac{2{\pi ^{d/2}}}{\Gamma (\frac{d}{2})}\sim \frac{2{\pi ^{d/2}}{2^{d/2}}{e^{d/2}}}{\sqrt{\pi d}{d^{d/2}}}=\frac{2}{\sqrt{\pi d}}{\bigg(\sqrt{\frac{2e\pi }{d}}\bigg)^{d}};\]
${s_{d}}$ achieves its maximum for $d=7$ (${s_{7}}=33.073$) and decays faster than exponentially as $d\to \infty $. This means that, without the normalization that we chose, our bound on the ${d_{2}}$ metric would be actually smaller by a factor of roughly ${d^{-d/2}}$ when the dimension grows. On the other hand, if we were to take the standard Lebesgue measure λ then we would obtain, by a standard application of Hermite expansions and the Diagram Formula
\[ \mathbb{E}\| F{\| _{{L^{2}}(\lambda )}^{2}}=\sum \limits_{q}{J_{q}^{2}}(\sigma ){\int _{{\mathbb{S}^{d-1}}}}\lambda (dx)=\sum \limits_{q}{J_{q}^{2}}(\sigma ){s_{d}},\]
so that the ${L^{2}}$ norm would decay very quickly as d increases, making the interpretation of results less transparent.
Following [8], the convergence in our central limit theorem is measured in the ${d_{2}}$ metric. This is given by
\[ {d_{2}}(F,Z)=\underset{\| h{\| _{{C_{b}^{2}}({L^{2}}({\mathbb{S}^{d-1}}))}}\le 1}{\sup }\big|\mathbb{E}h(F)-\mathbb{E}h(Z)\big|\hspace{2.5pt},\]
where ${C_{b}^{2}}({L^{2}}({\mathbb{S}^{d-1}}))$ is the space of real-valued functions on ${L^{2}}({\mathbb{S}^{d-1}})$ (where ${L^{2}}$ is taken with respect to the uniform measure) with bounded Frechet derivatives up to order 2. It is to be noted that the ${d_{2}}$ metric is bounded by the Wasserstein distance of order 2, i.e.
\[ {d_{2}}(F,Z)\le {\mathcal{W}_{2}}(F,Z):=\underset{(\widetilde{F},\widetilde{Z})}{\inf }{\big(\mathbb{E}\| \widetilde{F}-\widetilde{Z}{\| _{{L^{2}}({\mathbb{S}^{d-1}})}^{2}}\big)^{1/2}},\]
where the infimum is taken over all the possible couplings of $(F,Z)$.
Our first main statement is as follows.
Theorem 1.
Under the previous assumptions and notations, and letting Z be the Gaussian process with zero mean and same covariance as F, we have that, for all $Q\le {\log _{3}}\sqrt{n}$,
(3)
\[ {d_{2}}(F,Z)\le C\| \sigma \| \frac{1}{\sqrt[4]{n}}\sqrt{{\sum \limits_{q=0}^{Q}}{J_{q}^{2}}(\sigma )q{3^{q}}}+\frac{3}{2}\sqrt{{\sum \limits_{q=Q+1}^{\infty }}{J_{q}^{2}}(\sigma )},\]
where C is an absolute constant (in particular, independent of the input dimension d), and $\| \sigma \| $ is the ${L^{2}}$ norm of σ taken with respect to the Gaussian density on $\mathbb{R}$.
The proof is postponed to Section 4.1. From Theorem 1, optimizing over the choice of Q, it is immediate to obtain much more explicit bounds. In the case of polynomial decay of the Hermite coefficients, the choice $Q=\log n/(3\log 3)$ yields the following result.
Corollary 2.
In the same setting as in Theorem 1, for ${J_{q}}(\sigma )\lesssim {q^{-\alpha }}$, $\alpha \gt \frac{1}{2}$, we have
\[ {d_{2}}(F,Z)\le C\| \sigma \| \frac{1}{{(\log n)^{\alpha -\frac{1}{2}}}}.\]
Example 3 (ReLu).
As shown in Lemma 19, for the ReLu activation $\sigma (t)=t{\mathbb{I}_{[0,\infty )}}(t)$ we have that ${J_{q}}(\sigma )\lesssim {q^{-\frac{5}{4}}}$, whence we obtain the bound ${d_{2}}(F,Z)\lesssim {(\log n)^{-\frac{3}{2}}}$. Once again, we stress that the constant is independent of the input dimension d.
The statement of Theorem 1 is given in order to cover the most general activation functions, allowing for possibly nondifferentiable choices such as the ReLu. Under stronger conditions, the result can be improved; in particular, assuming the activation function has a Malliavin derivative with bounded fourth moment (i.e., it belongs to the class ${\mathbb{D}^{1,4}}$, see [25, 8]), we obtain the following extension.
Theorem 4.
Under the previous assumptions and notations, and assuming furthermore that $\sigma (Wx)\in {\mathbb{D}^{1,4}}$, we have that, for all $Q\in \mathbb{N}$,
(4)
\[ {d_{2}}(F,Z)\le C\frac{1}{\sqrt{n}}{\sum \limits_{q=0}^{Q}}{J_{q}^{2}}(\sigma )q{3^{q}}\Bigg(\| \sigma {\| ^{2}}+\frac{1}{\sqrt{n}}{\sum \limits_{q=0}^{Q}}{J_{q}^{2}}(\sigma ){3^{q}}\Bigg)+\frac{3}{2}\sqrt{{\sum \limits_{q=Q+1}^{\infty }}{J_{q}^{2}}(\sigma )},\]
where C is an absolute constant (in particular, independend of the input dimension d), and $\| \sigma \| $ is the ${L^{2}}$ norm of σ taken with respect to the Gaussian density on $\mathbb{R}$.
We prove Theorem 4 in Section 4.4. Again, imposing specific decay profiles on the Hermite expansion we can obtain explicit bounds. In particular, when ${J_{q}}\lesssim {e^{-\beta q}}$ with $\beta \gt \log \sqrt{3}$, the second sum appearing in (4) stays finite for all Q, hence the bound assumes the form
\[ {d_{2}}(F,Z)\le C\| \sigma {\| ^{2}}\frac{1}{\sqrt{n}}{\sum \limits_{q=0}^{Q}}{J_{q}^{2}}(\sigma )q{3^{q}}+\frac{3}{2}\sqrt{{\sum \limits_{q=Q+1}^{\infty }}{J_{q}^{2}}(\sigma )},\]
which is more in line with the bound (3). In such a case, letting Q to go to infinity leads to the next result.
Corollary 5.
In the same setting as in Theorem 4, for ${J_{q}}(\sigma )\lesssim {e^{-\beta q}}$, $\beta \gt \log \sqrt{3}$, we have
\[ {d_{2}}(F,Z)\le C\frac{1}{\sqrt{n}}.\]
Example 6 (polynomials/erf).
The assumptions of Corollary 5 are fulfilled by polynomial activations and by the error function $\operatorname{erf}(t)=\frac{2}{\sqrt{\pi }}{\textstyle\int _{0}^{t}}{e^{-{s^{2}}}}ds$, for which ${J_{q}^{2}}(\sigma )\lesssim {(2/3)^{q}}$ – cf. [19]. In these cases, the fact that $\sigma (Wx)\in {\mathbb{D}^{1,4}}$ can be readily shown by means of the triangle inequality and the standard hypercontractivity bound for Wiener chaos components – see [25, Corollary 2.8.14].
Example 7 (tanh/logistic).
Of course, other forms of decay could be considered. For instance, for the hyperbolic tangent $\sigma (t)=({e^{t}}-{e^{-t}})/({e^{t}}+{e^{-t}})$ the rate of decay of the Hermite coefficients is of order $\exp (-C\sqrt{q})$ (see, e.g., [13]), hence the result of Corollary 5 does not apply; the bounds in Corollary 2 obviously hold, but applying directly Theorem 1 and some algebra we obtain the finer bound
\[ {d_{2}}(F,Z)\lesssim \exp (-c\sqrt{\log n}),\hspace{1em}\text{for}\hspace{2.5pt}{J_{q}}(\sigma )\le \exp (-C\sqrt{q}).\]
The same bound holds also for the sigmoid/logistic activation function $\sigma (t)={(1+{e^{-t}})^{-1}}$.
Remark 8.
Lower bounds on the rates of convergence of neural networks to Gaussian processes are still an open question. In particular, we do not know whether the rates obtained in Corollary 2 and Corollary 5 are optimal. In Section 3 we compare our results to the previous literature.

2.1 Sketch of the proof and discussion

As a first step in our proof, cf. Section 4.1, we decompose our neural network F into two processes: ${F_{\le Q}}$, corresponding to its projection onto the first Q Wiener chaoses, and ${F_{\gt Q}}$, the remainder, where Q is an integer to be chosen below. This truncation-and-optimization approach is rather standard in the literature on Quantitative Central Limit Theorems, cf. [8, Remark 3.11]. By the triangle inequality, we can bound the distance of F from a suitable Gaussian process Z with the distance of Z from ${F_{\le Q}}$ plus the 2-Wasserstein distance of F from ${F_{\le Q}}$. This second part can be easily bounded by standard ${L^{2}}$ arguments, see (5).
For the leading term we follow a recent result by Bourguin and Campese [8], which we restate as Theorem 16, adapted to our framework. To the best of our knowledge, this is the first time when a link between the Stein–Malliavin method (see [25, 8] and the references therein) and neural networks has been established.
Thanks to this technique, the problem can be essentially reduced to a thorough analysis of fourth-order cumulants and covariances for the ${L^{2}}$ norms of the Wiener projections. Besides smaller order terms, we dominate the distance between Z and ${F_{\le Q}}$ with the sum of two terms, called M and C. Heuristically, M controls the expected distance of the fourth moments of the Wiener projections of Z and ${F_{\le Q}}$, while C accounts for the covariances between different projections of ${F_{\le Q}}$. In order to control M, in Proposition 9 we exploit the properties of Hermite polynomials and in particular the diagram formula (see [22, Proposition 4.15]). A detailed analysis of the possible configurations of the diagrams (Lemma 10 and Lemma 11) and of the covariances of the Hermite polynomials (Lemma 12) allows us to obtain bounds that, in particular, do not depend on the dimension d of the input, cf. Remark 13 and the discussion in Section 3. Finally, in Proposition 15 we show that C is bounded from above by M itself.
We point out that the strategy we follow relies on the Gaussianity of the distribution of the weights W. While nonquantitative versions of the CLT have been proved assuming only mild finite moment assumptions, see [15], Gaussianity has been required in the literature for the quantitative case so far (cf. [13, 19]). Our main technical tool, [8, Theorem 3.10], goes beyond the Gaussian case and Hermite polynomials, but for more general eigenfunctions no diagram formula is known and explicit computations (e.g., for estimating the cumulants) become impossible.
Another technical point that we shall address is the following. The convergence results by [8] require the limiting process to be nondegenerate; this condition is not always satisfied for arbitrary activation functions if one takes the corresponding Hilbert space to be ${L^{2}}({\mathbb{S}^{d-1}})$ (counter-examples being finite-order polynomials). However, we note that for activations for which the corresponding networks are dense in the space of continuous functions (such as the ReLu or the sigmoid and basically all nonpolynomials, see for instance the classical universal approximation theorems in [10, 16, 17, 21, 26]), then the nondegeneracy condition is automatically satisfied. On the other hand, when the condition fails, our results continue to hold, but the underlying functional space must be taken to be the reproducing kernel Hilbert space generated by the covariance operator, which is strictly included into ${L^{2}}({\mathbb{S}^{d-1}})$ when universal approximation fails (e.g., in the polynomial case).

3 A comparison with the existing literature

Two papers that have established quantitative functional central limit theorems for neural networks are [13] and [19]. Their settings and results are not entirely comparable to ours; on the one hand, they use the Wasserstein distance, which is slightly stronger that the ${d_{2}}$ metric we consider here. On the other hand, their model for the random weight is different from ours: for the outer layers, both consider Rademacher variables, while for the inner layer the distribution is Gaussian in [13] and uniform on the sphere in [19]; on the contrary, we assume the Gaussian distribution for both inner and outer layer. As a further (minor) difference, we note that in [13], as well as in our paper, input variables are in ${\mathbb{S}^{d-1}}$, while [19] considers $\sqrt{d}\hspace{0.2778em}{\mathbb{S}^{d-1}}$; this is just a notational issue, though, because in [19] the argument of the activation function is normalized by a factor $1/\sqrt{d}$.
Even with these important caveats, it is nevertheless of some interest to compare their bounds with ours, for activation functions for which there is an overlap. We report their results together with ours in Table 1 (the constant C may differ from one box to the other, but in all cases it does not depend neither on d nor on n).
Table 1.
Comparison of convergence rates established by different functional quantitative central limit theorems for several activation functions. Bear in mind that two different metrics ${d_{2}}\le {\mathcal{W}_{2}}$ are considered, ${\mathcal{W}_{2}}$ for [13, 19], and ${d_{2}}$ for this paper. The parameters α and β must satisfy $\alpha \gt 1/2$ and $\beta \gt \log \sqrt{3}$
Eldan et al. [13] Klukowski [19] This paper
${J_{q}}\sim {q^{-\alpha }}$ ${(\frac{\log n}{\log \log n\log d})^{-\alpha +\frac{1}{2}}}$ – ${(\log n)^{-\alpha +\frac{1}{2}}}$
ReLu ${(\frac{\log n}{\log \log n\log d})^{-\frac{3}{4}}}$ ${n^{-\frac{3}{4d-2}}}$ ${(\log n)^{-\frac{3}{4}}}$
tanh / logistic $\exp (-c\sqrt{\frac{\log n}{\log d\log \log n}})$ – $\exp (-c\sqrt{\log n})$
${J_{q}}\sim {e^{-\beta q}}$ ${n^{-c{(\log \log n\log d)^{-1}}}}$ – ${n^{-\frac{1}{2}}}$
erf ${n^{-c{(\log \log n\log d)^{-1}}}}$ ${C^{d}}{(\log n)^{\frac{d}{2}-1}}{n^{-\frac{1}{2}}}$ ${n^{-\frac{1}{2}}}$
polynomial order p ${p^{cp}}{d^{\frac{5p}{6}-\frac{1}{12}}}{n^{-\frac{1}{6}}}$ ${(d+p)^{\frac{d}{2}}}{n^{-\frac{1}{2}}}$ ${n^{-\frac{1}{2}}}$
Comparing to [13], our bounds remove a logarithmic factor in the input dimension and a $\log \log $ factor in the number of neurons for ReLu and tanh networks; for smooth activations, the rate goes from ${n^{-1/6}}$ to ${n^{-1/2}}$, and the constants lose the polynomial dependence on the dimension. The rate in [19] in the polynomial case is ${n^{-1/2}}$ as ours, but with a factor growing in the input dimension d as ${d^{d/2}}$. In the ReLu setting, [19] displays the algebraic rate ${n^{-\frac{3}{4d-2}}}$, which for fixed values of d decays faster than our logarithmic bound. However, interpretation of these bounds from a “fixed d, growing n” perspective can be incomplete: when considering distances in probability metrics it is of interest to allow both d and n to vary. In particular, for neural networks applications, it is often the case that the input dimension and number of neurons are of comparable order; taking for instance $d={d_{n}}\sim {n^{\alpha }}$, it is immediate to verify that for all $\alpha \gt 0$ (no matter how small) one has
\[ \underset{n\to \infty }{\lim }\frac{{(\log n)^{-\frac{3}{4}}}}{{n^{-\frac{3}{4d-2}}}}=\underset{n\to \infty }{\lim }\frac{{(\log n)^{-\frac{3}{4}}}}{\exp (-\frac{3}{4{n^{\alpha }}-2}\log n)}=0,\]
so that our bound in the ${d_{2}}$ metric decays faster that the one by [19] in ${\mathcal{W}_{2}}$ under these circumstances.

4 Proof of the main results

Our main results, Theorems 1 and 4, are proved in Sections 4.1 and 4.4, respectively. The proofs use auxiliary propositions and lemmas, which are established in Sections 4.2 and 4.3.

4.1 Proof of Theorem 1

The main idea behind our proof is as follows. For some integer Q to be fixed later, write
\[ F={F_{\le Q}}+{F_{\gt Q}},\]
where
\[ {F_{\le Q}}:={\sum \limits_{q=0}^{Q}}{F_{q}},\hspace{2em}{F_{\gt Q}}:={\sum \limits_{q=Q+1}^{\infty }}{F_{q}},\]
and
\[ {F_{q}}(x):=\frac{{J_{q}}(\sigma )}{\sqrt{n}}{\sum \limits_{j=1}^{n}}{V_{j}}\frac{{H_{q}}({W_{j}}x)}{\sqrt{q!}},\hspace{14.22636pt}x\in {\mathbb{S}^{d-1}}.\]
In words, as anticipated in Section 2.1, we are partitioning our network into a component projected onto the Q lowest Wiener chaoses and the remainder projection on the highest chaoses. Now recall that Z is the zero mean Gaussian process with covariance function
\[ \mathbb{E}\big[Z({x_{1}})Z({x_{2}})\big]:=S({x_{1}},{x_{2}})=\mathbb{E}\big[F({x_{1}})F({x_{2}})\big]={\sum \limits_{q=0}^{\infty }}{J_{q}^{2}}(\sigma ){\langle {x_{1}},{x_{2}}\rangle ^{q}}.\]
In the sequel we shall write ${\{{Z_{q}}\}_{q\in \mathbb{N}}}$ for a sequence of independent zero mean Gaussian variables with covariance function $\mathbb{E}[{Z_{q}}({x_{1}}){Z_{q}}({x_{2}})]:={J_{q}^{2}}(\sigma ){\langle {x_{1}},{x_{2}}\rangle ^{q}}$. Our idea is to use Theorem 3.10 in [8] and hence to consider
\[\begin{aligned}{}{d_{2}}(F,Z)& \le {d_{2}}({F_{\le Q}},Z)+{d_{2}}(F,{F_{\le Q}})\\ {} & \le \frac{1}{2}\big(\sqrt{M({F_{\le Q}})+C({F_{\le Q}})}+\| S-{S_{\le Q}}{\| _{{L^{2}}(\Omega ,\mathrm{HS})}}\big)+{\mathcal{W}_{2}}(F,{F_{\le Q}}),\end{aligned}\]
where
\[\begin{aligned}{}M({F_{\le Q}})& :=\frac{1}{\sqrt{3}}{\sum \limits_{p,q}^{Q}}{c_{p,q}}\sqrt{\mathbb{E}\| {F_{p}}{\| ^{4}}\big(\mathbb{E}\| {F_{q}}{\| ^{4}}-\mathbb{E}\| {Z_{q}}{\| ^{4}}\big)},\\ {} C({F_{\le Q}})& :={\sum \limits_{\begin{array}{c}p,q\\ {} p\ne q\end{array}}^{Q}}{c_{p,q}}\operatorname{Cov}\big(\| {F_{p}}{\| ^{2}},\| {F_{q}}{\| ^{2}}\big),\\ {} {c_{p,q}}& :=\left\{\begin{array}{l@{\hskip10.0pt}l}1+\sqrt{3},\hspace{1em}& p=q,\\ {} \frac{p+q}{2p},\hspace{1em}& p\ne q,\end{array}\right.\end{aligned}\]
and we have
(5)
\[ {\mathcal{W}_{2}}(F,{F_{\le Q}})\le \sqrt{{\sum \limits_{q=Q+1}^{\infty }}{J_{q}^{2}}(\sigma )}.\]
Moreover,
\[ \| S-{S_{\le Q}}{\| _{{L^{2}}(\Omega ,\mathrm{HS})}^{2}}\le {\sum \limits_{q=Q+1}^{\infty }}{J_{q}^{2}}(\sigma );\]
indeed, first note that the covariance operator can be written explicitly in coordinates as
\[\begin{aligned}{}& {S_{\le Q}}({x_{1}},{x_{2}})\\ {} & \hspace{1em}=\frac{1}{n}{\sum \limits_{p,q}^{Q}}{J_{p}}(\sigma ){J_{q}}(\sigma )\frac{1}{\sqrt{p!\hspace{0.1667em}q!}}{\sum \limits_{{j_{1}},{j_{2}}=1}^{n}}\mathbb{E}\big[\big\{{V_{{j_{1}}}}{H_{p}}({W_{{j_{1}}}}{x_{1}})\big\}\big\{{V_{{j_{2}}}}{H_{q}}({W_{{j_{2}}}}{x_{2}})\big\}\big]\\ {} & \hspace{1em}={\sum \limits_{q}^{Q}}{J_{q}^{2}}(\sigma ){\langle {x_{1}},{x_{2}}\rangle ^{q}},\end{aligned}\]
and hence
\[ S(x,y)-{S_{\le Q}}(x,y)={\sum \limits_{q=Q+1}^{\infty }}{J_{q}^{2}}(\sigma ){\langle x,y\rangle ^{q}}.\]
Therefore, taking the standard basis of spherical harmonics $\{{Y_{\ell m}}\}$, which are eigenfunctions of the covariance operators (see [22]),
\[\begin{aligned}{}& \| S-{S_{\le Q}}{\| _{{L^{2}}(\Omega ,\mathrm{HS})}^{2}}\\ {} & \hspace{1em}=\sum \limits_{\ell ,{\ell ^{\prime }},m,{m^{\prime }}}{\sum \limits_{q=Q+1}^{\infty }}{\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}{Y_{\ell m}}(x){Y_{{\ell ^{\prime }}{m^{\prime }}}}(y)\sum \limits_{{\ell ^{\prime\prime }}{m^{\prime\prime }}}{C_{{\ell ^{\prime\prime }}}}(q){Y_{{\ell ^{\prime\prime }}{m^{\prime\prime }}}}(x){Y_{{\ell ^{\prime\prime }}{m^{\prime\prime }}}}(y)dxdy\\ {} & \hspace{1em}=\sum \limits_{\ell ,{\ell ^{\prime }},m,{m^{\prime }}}{\sum \limits_{q=Q+1}^{\infty }}\sum \limits_{{\ell ^{\prime\prime }}{m^{\prime\prime }}}{C_{{\ell ^{\prime\prime }}}}(q){\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}{Y_{\ell m}}(x){Y_{{\ell ^{\prime }}{m^{\prime }}}}(y){Y_{{\ell ^{\prime\prime }}{m^{\prime\prime }}}}(x){Y_{{\ell ^{\prime\prime }}{m^{\prime\prime }}}}(y)dxdy\\ {} & \hspace{1em}=\sum \limits_{\ell ,{\ell ^{\prime }},m,{m^{\prime }}}{\sum \limits_{q=Q+1}^{\infty }}\sum \limits_{{\ell ^{\prime\prime }}{m^{\prime\prime }}}{C_{{\ell ^{\prime\prime }}}}(q){\delta _{\ell }^{{\ell ^{\prime\prime }}}}{\delta _{{\ell ^{\prime }}}^{{\ell ^{\prime\prime }}}}{\delta _{m}^{{m^{\prime\prime }}}}{\delta _{{m^{\prime }}}^{{m^{\prime\prime }}}}\\ {} & \hspace{1em}=\sum \limits_{\ell }{\sum \limits_{q=Q+1}^{\infty }}{C_{\ell }}(q){n_{\ell ;d}}\\ {} & \hspace{1em}={\sum \limits_{q=Q+1}^{\infty }}{J_{q}^{2}}(\sigma ),\end{aligned}\]
where ${n_{\ell ;d}}$ is the dimension of the ℓ-th eigenspace in dimension d and $\{{C_{\ell }}(q)\}$ is the angular power spectrum of ${F_{q}}$, see again [22] for more discussion and details (the discussion in this reference is restricted to $d=2$, but the results can be extended to any dimension).
We are left to bound $M({F_{\le Q}})$ and $C({F_{\le Q}})$. In Section 4.2 we will provide a bound for $M({F_{\le Q}})$. Under the conditon $Q\le {\log _{3}}\sqrt{n}$, such bound reduces to
\[ M({F_{\le Q}})\lesssim \frac{\| \sigma {\| ^{2}}}{\sqrt{n}}{\sum \limits_{q}^{Q}}{J_{q}^{2}}(\sigma )q{3^{q}}.\]
On the other hand, in Section 4.3 we will show that
\[ C({F_{\le Q}})\le M({F_{\le Q}}).\]
This completes the proof.

4.2 Bounding $M({F_{\le Q}})$

The following proposition provides a bound on $M({F_{\le Q}})$. The proof relies on several technical lemmas, which are given below.
Proposition 9.
We have
\[ M({F_{\le Q}})\lesssim \frac{1}{\sqrt{n}}{\sum \limits_{q=0}^{Q}}{J_{q}^{2}}q{3^{q}}\Bigg(\| \sigma {\| ^{2}}+\frac{1}{\sqrt{n}}{\sum \limits_{q=0}^{Q}}{J_{q}^{2}}{3^{q}}\Bigg).\]
Proof.
We can write
\[\begin{aligned}{}M({F_{\le Q}})& =\frac{1}{\sqrt{3}}{\sum \limits_{p,q}^{Q}}{c_{p,q}}\sqrt{\mathbb{E}\| {F_{p}}{\| ^{4}}\big(\mathbb{E}\| {F_{q}}{\| ^{4}}-\mathbb{E}\| {Z_{q}}{\| ^{4}}\big)}\\ {} & \le {\sum \limits_{p}^{Q}}\sqrt{\mathbb{E}\| {F_{p}}{\| ^{4}}}{\sum \limits_{q}^{Q}}q\sqrt{\mathbb{E}\| {F_{q}}{\| ^{4}}-\mathbb{E}\| {Z_{q}}{\| ^{4}}}.\end{aligned}\]
In Lemma 10 we compute
\[ \mathbb{E}\| {F_{q}}{\| ^{4}}-\mathbb{E}\| {Z_{q}}{\| ^{4}}=\frac{1}{n}\frac{{J_{q}^{4}}(\sigma )}{{(q!)^{2}}}{\sum \limits_{{q_{1}}=0}^{q-1}}{\Upsilon _{{q_{1}},q}}{\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}{\langle {x_{1}},{x_{2}}\rangle ^{2(q-{q_{1}})}}d{x_{1}}d{x_{2}},\]
with
\[ {\Upsilon _{{q_{1}},q}}={\left(\genfrac{}{}{0.0pt}{}{q}{{q_{1}}}\right)^{4}}{({q_{1}}!)^{2}}(2q-2{q_{1}})!.\]
By Lemma 11 we get the bound
\[ \underset{0\le {q_{1}}\le q-1}{\max }{\Upsilon _{{q_{1}},q}}\lesssim \frac{{(q!)^{2}}{3^{2q}}}{q},\]
whereas Lemma 12 yields
\[ {\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}{\langle {x_{1}},{x_{2}}\rangle ^{2(q-{q_{1}})}}d{x_{1}}d{x_{2}}\le 1.\]
Therefore,
\[ \mathbb{E}\| {F_{q}}{\| ^{4}}-\mathbb{E}\| {Z_{q}}{\| ^{4}}\lesssim \frac{{J_{q}^{4}}(\sigma ){3^{2q}}}{n}.\]
Moreover, in view of Lemma 14, we have
\[ \mathbb{E}\| {F_{p}}{\| ^{4}}\lesssim \frac{{J_{p}^{4}}(\sigma ){3^{2p}}}{n}+3{J_{p}^{4}}(\sigma ).\]
Collecting all the terms, we finally obtain the claim.  □
In the following, we collect the technical lemmas used in the proof of Proposition 9.
Lemma 10.
We have
\[ \mathbb{E}\| {F_{q}}{\| ^{4}}-\mathbb{E}\| {Z_{q}}{\| ^{4}}=\frac{1}{n}\frac{{J_{q}^{4}}(\sigma )}{{(q!)^{2}}}{\sum \limits_{{q_{1}}=0}^{q-1}}{\Upsilon _{{q_{1}},q}}{\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}{\langle {x_{1}},{x_{2}}\rangle ^{2(q-{q_{1}})}}d{x_{1}}d{x_{2}}\]
with ${\Upsilon _{{q_{1}},q}}={\left(\genfrac{}{}{0.0pt}{}{q}{{q_{1}}}\right)^{4}}{({q_{1}}!)^{2}}(2q-2{q_{1}})!$.
Proof.
We will write $\operatorname{Cum}(\cdot ,\cdot ,\cdot ,\cdot )$ for the joint cumulant of four random variables, that is,
\[ \operatorname{Cum}(X,Y,Z,W)=\mathbb{E}[XYZW]-\mathbb{E}[XY]\mathbb{E}[WZ]-\mathbb{E}[XZ]\mathbb{E}[WY]-\mathbb{E}[XW]\mathbb{E}[ZY].\]
We have
\[\begin{aligned}{}\mathbb{E}\| {F_{q}}{\| ^{4}}& =\frac{1}{{n^{2}}}\frac{{J_{q}^{4}}}{{(q!)^{2}}}{\sum \limits_{{j_{1}},{j_{2}},{j_{3}},{j_{4}}=1}^{n}}{\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}\\ {} & \hspace{0.2778em}\hspace{0.2778em}\hspace{0.2778em}\times \mathbb{E}\big\{{V_{{j_{1}}}}{H_{q}}({W_{{j_{1}}}}{x_{1}}){V_{{j_{2}}}}{H_{q}}({W_{{j_{2}}}}{x_{1}}){V_{{j_{3}}}}{H_{q}}({W_{{j_{3}}}}{x_{2}}){V_{{j_{4}}}}{H_{q}}({W_{{j_{4}}}}{x_{2}})\big\}d{x_{1}}d{x_{2}}\\ {} & =\frac{1}{n}\frac{{J_{q}^{4}}}{{(q!)^{2}}}{\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}\\ {} & \hspace{0.2778em}\hspace{0.2778em}\hspace{0.2778em}\times \operatorname{Cum}\big\{{V_{j}}{H_{q}}({W_{j}}{x_{1}}),{V_{j}}{H_{q}}({W_{j}}{x_{1}}),{V_{j}}{H_{q}}({W_{j}}{x_{2}}),{V_{j}}{H_{q}}({W_{j}}{x_{2}})\big\}d{x_{1}}d{x_{2}}\\ {} & \hspace{0.2778em}\hspace{0.2778em}\hspace{0.2778em}+\frac{{J_{q}^{4}}}{{(q!)^{2}}}{\bigg\{{\int _{{\mathbb{S}^{d-1}}}}\mathbb{E}\big\{{V_{j}}{H_{q}}({W_{j}}{x_{1}}){V_{j}}{H_{q}}({W_{j}}{x_{1}})\big\}d{x_{1}}\bigg\}^{2}}\\ {} & \hspace{0.2778em}\hspace{0.2778em}\hspace{0.2778em}+2\frac{{J_{q}^{4}}}{{(q!)^{2}}}{\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}{\big\{\mathbb{E}\big\{{V_{j}}{H_{q}}({W_{j}}{x_{1}}){V_{j}}{H_{q}}({W_{j}}{x_{2}})\big\}\big\}^{2}}d{x_{1}}d{x_{2}}.\end{aligned}\]
Now note that, in view of the normalization we adopted for the volume of ${\mathbb{S}^{d-1}}$,
\[\begin{aligned}{}& \frac{1}{{(q!)^{2}}}{\bigg\{{\int _{{\mathbb{S}^{d-1}}}}\mathbb{E}\big\{{V_{j}}{H_{q}}({W_{j}}{x_{1}}){V_{j}}{H_{q}}({W_{j}}{x_{1}})\big\}d{x_{1}}\bigg\}^{2}}=1,\\ {} & \frac{1}{{(q!)^{2}}}{\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}{\big\{\mathbb{E}\big\{{V_{j}}{H_{q}}({W_{j}}{x_{1}}){V_{j}}{H_{q}}({W_{j}}{x_{2}})\big\}\big\}^{2}}d{x_{1}}d{x_{2}}\\ {} & \hspace{1em}={\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}{\langle {x_{1}},{x_{2}}\rangle ^{2q}}d{x_{1}}d{x_{2}}.\end{aligned}\]
Moreover,
\[\begin{aligned}{}& {\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}\mathbb{E}\big\{{Z_{q}^{2}}({x_{1}}){Z_{q}^{2}}({x_{2}})\big\}d{x_{1}}d{x_{2}}\\ {} & \hspace{1em}={\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}\mathbb{E}\big\{{Z_{q}^{2}}({x_{1}})\big\}\mathbb{E}\big\{{Z_{q}^{2}}({x_{2}})\big\}d{x_{1}}d{x_{2}}\\ {} & \hspace{1em}\hspace{1em}+2{\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}\mathbb{E}\big\{{Z_{q}}({x_{1}}){Z_{q}}({x_{2}})\big\}\mathbb{E}\big\{{Z_{q}}({x_{1}}){Z_{q}}({x_{2}})\big\}d{x_{1}}d{x_{2}}\\ {} & \hspace{1em}={J_{q}^{4}}+2{J_{q}^{4}}{\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}{\langle {x_{1}},{x_{2}}\rangle ^{2q}}d{x_{1}}d{x_{2}}.\end{aligned}\]
Hence,
\[\begin{aligned}{}& \hspace{1em}\hspace{2.5pt}\mathbb{E}\| {F_{q}}{\| ^{4}}-\mathbb{E}\| {Z_{q}}{\| ^{4}}\\ {} & =\frac{1}{n}\frac{{J_{q}^{4}}}{{(q!)^{2}}}\\ {} & \times {\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}\hspace{-0.1667em}\hspace{-0.1667em}\hspace{-0.1667em}\hspace{-0.1667em}\operatorname{Cum}\big\{{V_{1}}{H_{q}}({W_{1}}{x_{1}}),{V_{1}}{H_{q}}({W_{1}}{x_{1}}),{V_{1}}{H_{q}}({W_{1}}{x_{2}}),{V_{1}}{H_{q}}({W_{1}}{x_{2}})\big\}d{x_{1}}d{x_{2}}.\end{aligned}\]
Using the diagram formula for Hermite polynomials [22, Proposition 4.15] and then isotropy, for ${q_{1}}+{q_{2}}+{q_{3}}+{q_{4}}=2q$ we have
\[\begin{aligned}{}& {\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}\operatorname{Cum}\big\{{V_{1}}{H_{q}}({W_{1}}{x_{1}}),{V_{1}}{H_{q}}({W_{1}}{x_{1}}),{V_{1}}{H_{q}}({W_{1}}{x_{2}}),{V_{1}}{H_{q}}({W_{1}}{x_{2}})\big\}d{x_{1}}d{x_{2}}\\ {} & \hspace{1em}=\sum \limits_{{q_{1}}+{q_{2}}+{q_{3}}+{q_{4}}=2q}{\Upsilon _{{q_{1}}{q_{2}}{q_{3}}{q_{4}}}}\\ {} & \hspace{2em}\times {\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}{\langle {x_{1}},{x_{1}}\rangle ^{{q_{1}}}}{\langle {x_{1}},{x_{2}}\rangle ^{{q_{2}}}}{\langle {x_{2}},{x_{2}}\rangle ^{{q_{3}}}}{\langle {x_{2}},{x_{1}}\rangle ^{{q_{4}}}}d{x_{1}}d{x_{2}}\\ {} & \hspace{1em}={\sum \limits_{{q_{1}}=0}^{q-1}}{\Upsilon _{{q_{1}},q}}{\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}{\langle {x_{1}},{x_{2}}\rangle ^{2(q-{q_{1}})}}d{x_{1}}d{x_{2}},\end{aligned}\]
where ${\Upsilon _{{q_{1}}{q_{2}}{q_{3}}{q_{4}}}}$, ${\Upsilon _{{q_{1}},q}}$ count the possible configurations of the diagrams. Precisely, ${\Upsilon _{{q_{1}},q}}$ is the number of connected diagrams with no flat edges between four rows of q nodes each and ${q_{1}}\lt q$ connections between first and second row. To compute this number explicitly, let us label the nodes of the diagram as
\[\begin{array}{c@{\hskip10.0pt}c@{\hskip10.0pt}c@{\hskip10.0pt}c@{\hskip10.0pt}c@{\hskip10.0pt}c@{\hskip10.0pt}c@{\hskip10.0pt}c@{\hskip10.0pt}c}{x_{1}}& {x_{1}}& {x_{1}}& \dots & & & & & {x_{1}}\\ {} {x^{\prime }_{1}}& {x^{\prime }_{1}}& {x^{\prime }_{1}}& \dots & & & & & {x^{\prime }_{1}}\\ {} {x_{2}}& {x_{2}}& {x_{2}}& \dots & & & & & {x_{2}}\\ {} {x^{\prime }_{2}}& {x^{\prime }_{2}}& {x^{\prime }_{2}}& \dots & & & & & {x^{\prime }_{2}}\end{array}.\]
Because there cannot be flat edges, the number of edges between ${x_{1}}$ and ${x^{\prime }_{1}}$ is the same as the number of edges between ${x_{2}}$ and ${x^{\prime }_{2}}$. Indeed, assume that the former was larger than the latter; then there would be less edges starting from the pair $({x_{1}},{x^{\prime }_{1}})$ and reaching the pair $({x_{2}},{x^{\prime }_{2}})$ than the other way round, which is obviously absurd. There are $\left(\genfrac{}{}{0.0pt}{}{q}{{q_{1}}}\right)$ ways to choose the nodes of the first row connected with the second, $\left(\genfrac{}{}{0.0pt}{}{q}{{q_{1}}}\right)$ ways to choose the nodes of the second connected with the first, $\left(\genfrac{}{}{0.0pt}{}{q}{{q_{1}}}\right)$ ways to choose the nodes of the third connected with the fourth, and $\left(\genfrac{}{}{0.0pt}{}{q}{{q_{1}}}\right)$ ways to choose the nodes of the fourth connected with the third, which gives a term of cardinality ${\left(\genfrac{}{}{0.0pt}{}{q}{{q_{1}}}\right)^{4}}$; the number of ways to match the nodes between first and second row or third and fourth is ${({q_{1}}!)^{2}}$. There are now $(2q-2{q_{1}})$ nodes left in the first two rows, which can be matched in any arbitrary way with the $(2q-2{q_{1}})$ remaining nodes of the third and the fourth row; the result follows immediately.  □
Lemma 11.
The following bound holds true:
\[ \underset{0\le {q_{1}}\le q-1}{\max }{\Upsilon _{{q_{1}},q}}\lesssim \frac{{(q!)^{2}}{3^{2q}}}{q}.\]
Proof.
We can write
\[\begin{aligned}{}\frac{1}{{(q!)^{2}}}{\Upsilon _{{q_{1}},q}}& =\frac{1}{{(q!)^{2}}}{\left(\genfrac{}{}{0.0pt}{}{q}{{q_{1}}}\right)^{4}}{({q_{1}}!)^{2}}(2q-2{q_{1}})!\\ {} & =\frac{{(q!)^{2}}(2q-2{q_{1}})!}{{({q_{1}}!)^{2}}{((q-{q_{1}})!)^{2}}{((q-{q_{1}})!)^{2}}}={\left(\genfrac{}{}{0.0pt}{}{q}{{q_{1}}}\right)^{2}}\left(\genfrac{}{}{0.0pt}{}{2q-2{q_{1}}}{q-{q_{1}}}\right).\end{aligned}\]
Note that both the elements in the last expression are decreasing in ${q_{1}}$, when ${q_{1}}\gt \frac{q}{2}$ (say). Fix $\alpha \in [\varepsilon ,\frac{1}{2}+\varepsilon ]$, $\varepsilon \gt 0$; repeated use of Stirling’s approximation gives
\[\begin{aligned}{}& \hspace{1em}\hspace{2.5pt}\frac{{(q!)^{2}}(2q-2{q_{1}})!}{{({q_{1}}!)^{2}}{((q-{q_{1}})!)^{2}}{((q-{q_{1}})!)^{2}}}\\ {} & \sim \frac{1}{{(2\pi )^{3/2}}}\frac{{q^{2q+1}}{(2q-2{q_{1}})^{2q-2{q_{1}}+\frac{1}{2}}}{e^{2{q_{1}}}}{e^{2(q-{q_{1}})}}{e^{2q-2{q_{1}}}}}{{e^{2q}}{e^{2q-2{q_{1}}}}{q_{1}^{2{q_{1}}+1}}{(q-{q_{1}})^{4q-4{q_{1}}+2}}}\\ {} & \sim \frac{{2^{2q-2{q_{1}}+\frac{1}{2}}}}{{(2\pi )^{3/2}}}\frac{{q^{2q+1}}}{{q_{1}^{2{q_{1}}+1}}{(q-{q_{1}})^{2q-2{q_{1}}+\frac{3}{2}}}}\end{aligned}\]
Taking ${q_{1}}=\alpha q$ we obtain
\[\begin{aligned}{}& \hspace{1em}\hspace{2.5pt}\frac{{2^{2(1-\alpha )q+\frac{1}{2}}}}{{(2\pi )^{3/2}}}\frac{{q^{2q+1}}}{{(\alpha q)^{2\alpha q+1}}{((1-\alpha )q)^{2(1-\alpha )q+\frac{3}{2}}}}\\ {} & =\frac{{2^{2(1-\alpha )q+\frac{1}{2}}}}{{(2\pi )^{3/2}}}\frac{1}{{(\alpha )^{2\alpha q+1}}{((1-\alpha ))^{2(1-\alpha )q+\frac{3}{2}}}{q^{\frac{3}{2}}}}\\ {} & =\frac{{2^{\frac{1}{2}}}}{{(2\pi )^{3/2}}{q^{\frac{3}{2}}}}{\bigg(\frac{{2^{1-\alpha }}}{{\alpha ^{\alpha +\frac{1}{2q}}}{(1-\alpha )^{1-\alpha +\frac{3}{4q}}}}\bigg)^{2q}}.\end{aligned}\]
It can be immediately checked that the function $f(\alpha ):=\frac{{2^{1-\alpha }}}{{\alpha ^{\alpha }}{(1-\alpha )^{1-\alpha }}}$ admits a unique maximum at $\alpha =\frac{1}{3}$, for which the quantity gets bounded by ${q^{-\frac{3}{2}}}{3^{2q}}$ up to constants. On the other hand, for ${q_{1}}\lt \lfloor \varepsilon q\rfloor $ it suffices to notice that
\[\begin{aligned}{}{\left(\genfrac{}{}{0.0pt}{}{q}{{q_{1}}}\right)^{2}}\left(\genfrac{}{}{0.0pt}{}{2q-2{q_{1}}}{q-{q_{1}}}\right)& \le {2^{2q}}{\left(\genfrac{}{}{0.0pt}{}{q}{{q_{1}}}\right)^{2}}\le {2^{2q}}{\left(\genfrac{}{}{0.0pt}{}{q}{\lfloor \varepsilon q\rfloor }\right)^{2}}\\ {} & \le {2^{2q}}\frac{{q^{2q+1}}}{2\pi {(\varepsilon q)^{2\varepsilon q+1}}{((1-\varepsilon )q)^{2(1-\varepsilon )q+1}}},\end{aligned}\]
where we used the fact that $g(\varepsilon )={\varepsilon ^{-\varepsilon }}{(1-\varepsilon )^{-(1-\varepsilon )}}$ is strictly increasing in $(0,\frac{1}{2})$; hence we get
\[ {2^{2q}}\frac{{q^{2q+1}}}{2\pi {(\varepsilon q)^{2\varepsilon q+1}}{((1-\varepsilon )q)^{2(1-\varepsilon )q+1}}}=\frac{{2^{2q}}}{2\pi q}\frac{1}{{({(\varepsilon )^{\varepsilon +\frac{1}{2q}}}{((1-\varepsilon ))^{(1-\varepsilon )+\frac{1}{2q}}})^{2q}}}.\]
The result is proved by choosing ε such that
\[ {\big({(\varepsilon )^{\varepsilon +\frac{1}{2q}}}{\big((1-\varepsilon )\big)^{(1-\varepsilon )+\frac{1}{2q}}}\big)^{-1}}\lt \frac{3}{2}.\]
 □
We recall the standard definition of the Beta function $B(\alpha ,\beta )$:
\[ B(\alpha ,\beta )=\frac{\Gamma (\alpha )\Gamma (\beta )}{\Gamma (\alpha +\beta )},\hspace{1em}\Gamma (\alpha )={\int _{0}^{\infty }}{t^{\alpha -1}}\exp (-t)dt,\hspace{1em}\alpha ,\beta \gt 0.\]
Lemma 12.
We have
\[ {\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}{\langle {x_{1}},{x_{2}}\rangle ^{2(q-{q_{1}})}}d{x_{1}}d{x_{2}}=\frac{{s_{d-1}}}{{s_{d}}}B\bigg(q-{q_{1}}+\frac{1}{2},\frac{d}{2}-\frac{1}{2}\bigg)\le 1.\]
Proof.
Fixing a pole and switching to spherical coordinates, we get
\[\begin{aligned}{}& \hspace{1em}\hspace{2.5pt}{\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}{\langle {x_{1}},{x_{2}}\rangle ^{2q-2{q_{1}}}}d{x_{1}}d{x_{2}}\\ {} & =\frac{{s_{d-1}}}{{s_{d}}}{\int _{0}^{\pi }}{(\cos \theta )^{2q-2{q_{1}}}}{(\sin \theta )^{d-2}}d\theta \\ {} & =\frac{{s_{d-1}}}{{s_{d}}}{\int _{0}^{\pi /2}}{\big({\cos ^{2}}\theta \big)^{q-{q_{1}}-\frac{1}{2}}}{\big(1-{\cos ^{2}}\theta \big)^{\frac{d-3}{2}}}d\cos \theta \\ {} & =\frac{{s_{d-1}}}{{s_{d}}}{\int _{0}^{1}}{t^{q-{q_{1}}-\frac{1}{2}}}{(1-t)^{\frac{d-3}{2}}}dt=\frac{{s_{d-1}}}{{s_{d}}}B\bigg(q-{q_{1}}+\frac{1}{2},\frac{d-1}{2}\bigg),\end{aligned}\]
which is smaller than 1 for all d, q.  □
Remark 13.
The bound we obtain is actually uniform over d. It is likely that it could be further improved for growing numbers of d, because the Beta function decreases quickly as d diverges.
Lemma 14.
We have
\[ \mathbb{E}\| {F_{p}}{\| ^{4}}\le \mathbb{E}\| {F_{p}}{\| ^{4}}-\mathbb{E}\| {Z_{p}}{\| ^{4}}+3{J_{p}^{4}}\hspace{2.5pt}\textit{.}\]
Proof.
It suffices to observe that, following the calculations of Lemma 10,
\[ \mathbb{E}\| {Z_{q}}{\| ^{4}}={J_{q}^{4}}+2{J_{q}^{4}}{\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}{\langle {x_{1}},{x_{2}}\rangle ^{2q}}d{x_{1}}d{x_{2}}\le 3{J_{q}^{4}}.\]
 □

4.3 Bounding $C({F_{\le Q}})$

The following results reduces the problem of bounding $C({F_{\le Q}})$ to that of bounding $M({F_{\le Q}})$.
Proposition 15.
We have
\[ C({F_{\le Q}})\le M({F_{\le Q}}).\]
Proof.
We shall show that
\[\begin{aligned}{}C({F_{\le Q}})& ={\sum \limits_{p,q:p\ne q}^{Q}}{c_{p,q}}{\sum \limits_{{p_{1}}=p-q}^{p-1}}{\left(\genfrac{}{}{0.0pt}{}{p}{{p_{1}}}\right)^{2}}{({p_{1}}!)^{2}}{\left(\genfrac{}{}{0.0pt}{}{q}{q-p+{p_{1}}}\right)^{2}}\\ {} & \hspace{1em}\times {\big((q-p+{p_{1}})!\big)^{2}}\big(2(p-{p_{1}})\big)!\\ {} & \hspace{1em}\times {\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}{\langle {x_{1}},{x_{2}}\rangle ^{2(p-{p_{1}})}}d{x_{1}}d{x_{2}}\\ {} & \le {\sum \limits_{p,q}^{Q}}{\sum \limits_{{p_{1}}=1}^{p}}{c_{p,q}}{\left(\genfrac{}{}{0.0pt}{}{p}{{p_{1}}}\right)^{4}}{({p_{1}}!)^{2}}(2p-2{p_{1}})!{\int _{{S^{d-1}}\times {S^{d-1}}}}{\langle {x_{1}},{x_{2}}\rangle ^{2(p-{p_{1}})}}d{x_{1}}d{x_{2}}\\ {} & =M({F_{\le Q}}).\end{aligned}\]
Recall that
\[\begin{aligned}{}& \operatorname{Cov}\big(\| {F_{p}}{\| ^{2}},\| {F_{q}}{\| ^{2}}\big)\\ {} & =\frac{{J_{p}^{2}}{J_{q}^{2}}}{n}\hspace{-0.1667em}{\int _{{\mathbb{S}^{d-1}}}}\hspace{-0.1667em}{\int _{{\mathbb{S}^{d-1}}}}\hspace{-0.1667em}\hspace{-0.1667em}\hspace{-0.1667em}\operatorname{Cum}\big\{V{H_{p}}(W{x_{1}}),\hspace{-0.1667em}V{H_{p}}(W{x_{1}}),\hspace{-0.1667em}V{H_{q}}(W{x_{2}}),\hspace{-0.1667em}V{H_{q}}(W{x_{2}})\big\}d{x_{1}}d{x_{2}}.\end{aligned}\]
Indeed,
\[ \| {F_{p}}{\| ^{2}}=\frac{{J_{p}^{2}}}{n}\frac{1}{p!}{\sum \limits_{{j_{1}},{j_{2}}=1}^{n}}{\int _{{S^{d-1}}}}\big\{{V_{{j_{1}}}}{H_{p}}({W_{{j_{1}}}}x)\big\}\big\{{V_{{j_{2}}}}{H_{p}}({W_{{j_{2}}}}x)\big\}dx,\]
and
\[ \operatorname{Cov}\big(\| {F_{p}}{\| ^{2}},\| {F_{q}}{\| ^{2}}\big)=\mathbb{E}\big(\| {F_{p}}{\| ^{2}}\| {F_{q}}{\| ^{2}}\big)-\mathbb{E}\big(\| {F_{p}}{\| ^{2}}\big)\mathbb{E}\big(\| {F_{q}}{\| ^{2}}\big),\]
where
\[\begin{aligned}{}& \mathbb{E}\big(\| {F_{p}}{\| ^{2}}\| {F_{q}}{\| ^{2}}\big)=\frac{{J_{p}^{2}}{J_{q}^{2}}}{{n^{2}}}\frac{1}{p!\hspace{0.1667em}q!}{\sum \limits_{{j_{1}},{j_{2}}=1}^{n}}{\sum \limits_{{j_{3}},{j_{4}}=1}^{n}}{\int _{{\mathbb{S}^{d-1}}}}{\int _{{\mathbb{S}^{d-1}}}}\\ {} & \mathbb{E}\big[{V_{{j_{1}}}}{H_{p}}({W_{{j_{1}}}}{x_{1}}){V_{{j_{2}}}}{H_{p}}({W_{{j_{2}}}}{x_{1}}){V_{{j_{3}}}}{H_{q}}({W_{{j_{3}}}}{x_{2}}){V_{{j_{4}}}}{H_{q}}({W_{{j_{4}}}}{x_{2}})\big]d{x_{1}}d{x_{2}}\\ {} =\hspace{2.25pt}& \frac{{J_{p}^{2}}{J_{q}^{2}}}{n}\hspace{-0.1667em}\hspace{-0.1667em}\hspace{-0.1667em}{\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}\operatorname{Cum}\big[{V_{1}}{H_{p}}({W_{1}}{x_{1}}),{V_{1}}{H_{p}}({W_{1}}{x_{1}}),{V_{1}}{H_{q}}({W_{1}}{x_{2}}),{V_{1}}{H_{q}}({W_{1}}{x_{2}})\big]d{x_{1}}d{x_{2}}\\ {} +\hspace{2.25pt}& {J_{p}^{2}}{J_{q}^{2}}{\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}\mathbb{E}\big[{V_{1}}{H_{p}}({W_{1}}{x_{1}}){V_{1}}{H_{p}}({W_{1}}{x_{1}})\big]\mathbb{E}\big[{V_{1}}{H_{q}}({W_{1}}{x_{2}}){V_{1}}{H_{q}}({W_{1}}{x_{2}})\big]d{x_{1}}d{x_{2}}\\ {} +\hspace{2.25pt}& 2{J_{p}^{2}}{J_{q}^{2}}{\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}\mathbb{E}\big[{V_{1}}{H_{p}}({W_{1}}{x_{1}}){V_{1}}{H_{q}}({W_{1}}{x_{2}})\big]\mathbb{E}\big[{V_{1}}{H_{p}}({W_{1}}{x_{1}}){V_{1}}{H_{q}}({W_{1}}{x_{2}})\big]d{x_{1}}d{x_{2}}.\end{aligned}\]
By the orthogonality of the Hermite polynomials, the third term vanishes and we are left with
\[\begin{aligned}{}& \hspace{2.22499pt}\frac{{J_{p}^{2}}{J_{q}^{2}}}{n}{\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}\operatorname{Cum}\big[{V_{1}}{H_{p}}({W_{1}}{x_{1}}),{V_{1}}{H_{p}}({W_{1}}{x_{1}}),{V_{1}}{H_{q}}({W_{1}}{x_{2}}),{V_{1}}{H_{q}}({W_{1}}{x_{2}})\big]d{x_{1}}d{x_{2}}\\ {} & +{J_{p}^{2}}{J_{q}^{2}}{\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}\mathbb{E}\big[{V_{1}}{H_{p}}({W_{1}}{x_{1}}){V_{1}}{H_{p}}({W_{1}}{x_{1}})\big]\mathbb{E}\big[{V_{1}}{H_{q}}({W_{1}}{x_{2}}){V_{1}}{H_{q}}({W_{1}}{x_{2}})\big]d{x_{1}}d{x_{2}}\\ {} & =\frac{{J_{p}^{2}}{J_{q}^{2}}}{n}{\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}\operatorname{Cum}\big[{V_{1}}{H_{p}}({W_{1}}{x_{1}}),{V_{1}}{H_{p}}({W_{1}}{x_{1}}),{V_{1}}{H_{q}}({W_{1}}{x_{2}}),{V_{1}}{H_{q}}({W_{1}}{x_{2}})\big]d{x_{1}}d{x_{2}}\\ {} & +\bigg({J_{p}^{2}}{\int _{{\mathbb{S}^{d-1}}}}\mathbb{E}\big[{V_{1}}{H_{p}}({W_{1}}{x_{1}}){V_{1}}{H_{p}}({W_{1}}{x_{1}})\big]d{x_{1}}\bigg)\bigg({J_{q}^{2}}{\int _{{\mathbb{S}^{d-1}}}}\mathbb{E}\big[{V_{1}}{H_{q}}({W_{1}}{x_{2}}){V_{1}}{H_{q}}({W_{1}}{x_{2}})\big]d{x_{2}}\bigg)\\ {} & =\frac{{J_{p}^{2}}{J_{q}^{2}}}{n}{\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}\operatorname{Cum}\big[{V_{1}}{H_{p}}({W_{1}}{x_{1}}),{V_{1}}{H_{p}}({W_{1}}{x_{1}}),{V_{1}}{H_{q}}({W_{1}}{x_{2}}),{V_{1}}{H_{q}}({W_{1}}{x_{2}})\big]d{x_{1}}d{x_{2}}\\ {} & +\mathbb{E}\big(\| {F_{p}}{\| ^{2}}\big)\mathbb{E}\big(\| {F_{q}}{\| ^{2}}\big).\end{aligned}\]
Indeed,
\[\begin{aligned}{}\mathbb{E}\big(\| {F_{p}}{\| ^{2}}\big)& ={J_{p}^{2}}\mathbb{E}\Bigg[{\sum \limits_{{j_{1}},{j_{2}}=1}^{n}}{\int _{{\mathbb{S}^{d-1}}}}\big\{{V_{{j_{1}}}}{H_{p}}({W_{{j_{1}}}}x)\big\}\big\{{V_{{j_{2}}}}{H_{p}}({W_{{j_{2}}}}x)\big\}dx\Bigg]\\ {} & ={J_{p}^{2}}\mathbb{E}\bigg[n{\int _{{\mathbb{S}^{d-1}}}}\big\{{V_{1}}{H_{p}}({W_{1}}x)\big\}\big\{{V_{1}}{H_{p}}({W_{1}}x)\big\}dx\bigg].\end{aligned}\]
Now note that
\[\begin{aligned}{}& \frac{{J_{p}^{2}}{J_{q}^{2}}}{n}\hspace{-0.1667em}{\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}\hspace{-0.1667em}\hspace{-0.1667em}\hspace{-0.1667em}\operatorname{Cum}\big[\hspace{-0.1667em}{V_{1}}{H_{p}}({W_{1}}{x_{1}}),\hspace{-0.1667em}{V_{1}}{H_{p}}({W_{1}}{x_{1}}),\hspace{-0.1667em}{V_{1}}{H_{q}}({W_{1}}{x_{2}}),\hspace{-0.1667em}{V_{1}}{H_{q}}({W_{1}}{x_{2}})\hspace{-0.1667em}\big]d{x_{1}}d{x_{2}}\\ {} & =\frac{{J_{p}^{2}}{J_{q}^{2}}}{n}{\sum \limits_{{p_{1}}=p-q}^{p-1}}{\left(\genfrac{}{}{0.0pt}{}{p}{{p_{1}}}\right)^{2}}{p_{1}}!{\left(\genfrac{}{}{0.0pt}{}{q}{q-p+{p_{1}}}\right)^{2}}(q-p+{p_{1}})!\big(2(p-{p_{1}})\big)!\\ {} & \hspace{1em}\times {\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}{\langle {x_{1}},{x_{2}}\rangle ^{2(p-{p_{1}})}}d{x_{1}}d{x_{2}}.\end{aligned}\]
Moreover,
\[ {\left(\genfrac{}{}{0.0pt}{}{q}{q-p+{p_{1}}}\right)^{2}}{\big((q-p+{p_{1}})!\big)^{2}}=\frac{{(q!)^{2}}}{{((p-{p_{1}})!)^{2}}}\le \frac{{(p!)^{2}}}{{((p-{p_{1}})!)^{2}}}={\left(\genfrac{}{}{0.0pt}{}{p}{{p_{1}}}\right)^{2}}{({p_{1}}!)^{2}},\]
and hence
\[\begin{aligned}{}& {\sum \limits_{{p_{1}}=p-q}^{p-1}}{\left(\genfrac{}{}{0.0pt}{}{p}{{p_{1}}}\right)^{2}}{({p_{1}}!)^{2}}{\left(\genfrac{}{}{0.0pt}{}{q}{q-p+{p_{1}}}\right)^{2}}{\big((q-p+{p_{1}})!\big)^{2}}\big(2(p-{p_{1}})\big)!\\ {} & \times {\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}{\langle {x_{1}},{x_{2}}\rangle ^{2(p-{p_{1}})}}d{x_{1}}d{x_{2}}\\ {} \le \hspace{2.5pt}& {\sum \limits_{{p_{1}}=1}^{p}}{\left(\genfrac{}{}{0.0pt}{}{p}{{p_{1}}}\right)^{4}}{({p_{1}}!)^{2}}(2p-2{p_{1}})!{\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}{\langle {x_{1}},{x_{2}}\rangle ^{2(p-{p_{1}})}}d{x_{1}}d{x_{2}},\end{aligned}\]
so that our previous bound on the fourth cumulant is sufficient, up to a factor $\frac{{J_{q}^{2}}}{{J_{p}^{2}}}\frac{q!}{p!}\ll 1$.  □

4.4 Proof of Theorem 4

The proof of Theorem 4 takes advantage of the tighter bounds which are obtained in [8, Section 4]; we refer to this paper and Section A.1, together with the monograph [25], for more details on the notation and further discussion.
Consider the isonormal Gaussian process with the underlying Hilbert space
\[ \mathcal{H}:={L^{2}}[0,2\pi ]\otimes {L^{2}}[0,2\pi ]\otimes {\mathbb{R}^{d}};\]
we take
\[\begin{aligned}{}{V_{j}}& =I({f_{{V_{j}}}})=I\bigg(\frac{\cos (\cdot )}{\sqrt{\pi }}\otimes \frac{\exp (ij\cdot )}{\sqrt{2\pi }}\otimes z\bigg)\hspace{2.5pt}\text{for some fixed}\hspace{2.5pt}z\hspace{2.5pt}\text{such that}\hspace{2.5pt}\| z{\| _{{\mathbb{R}^{d}}}}=1,\\ {} {W_{j}}x& =I({f_{{W_{j}}x}})=I\bigg(\frac{\sin (\cdot )}{\sqrt{\pi }}\otimes \frac{\exp (ij\cdot )}{\sqrt{2\pi }}\otimes x\bigg)\hspace{2.5pt}\text{for any}\hspace{2.5pt}x\in {\mathbb{S}^{d-1}}.\end{aligned}\]
It is readily seen that these are two Gaussian, zero mean, unit variance random variables, with covariances
\[ \mathbb{E}[{V_{j}}{V_{{j^{\prime }}}}]={\delta _{j}^{{j^{\prime }}}},\hspace{2.5pt}\mathbb{E}[{V_{j}}{W_{j}}x]=0,\hspace{2.5pt}\mathbb{E}[{W_{j}}{x_{1}}{W_{{j^{\prime }}}}{x_{2}}]={\delta _{j}^{{j^{\prime }}}}{\langle {x_{1}},{x_{2}}\rangle _{{\mathbb{R}^{d}}}}.\]
Also, we have that
\[\begin{aligned}{}\frac{1}{\sqrt{n}}{\sum \limits_{j=1}^{n}}{V_{j}}\sigma ({W_{j}}x)& =\frac{1}{\sqrt{n}}{\sum \limits_{j=1}^{n}}{\sum \limits_{p=1}^{\infty }}\frac{{J_{p}}(\sigma )}{\sqrt{p!}}I({f_{{V_{j}}}}){H_{p}}({W_{j}}x)\\ {} & =\frac{1}{\sqrt{n}}{\sum \limits_{j=1}^{n}}{\sum \limits_{p=1}^{\infty }}\frac{{J_{p}}(\sigma )}{\sqrt{p!}}I({f_{{V_{j}}}}){I_{p}}\big({f_{{W_{j}}x}^{\otimes p}}\big),\end{aligned}\]
where we have used the standard identity linking Hermite polynomials and multiple stochastic integrals (i.e., Theorem 2.7.7 in [25]). To evaluate the term $I({f_{{V_{j}}}}){I_{p}}({f_{{W_{j}}x}^{\otimes p}})$, we recall the product formula [25, Theorem 2.7.10]
\[ {I_{p}}\big({f^{\otimes p}}\big){I_{q}}\big({g^{\otimes q}}\big)={\sum \limits_{r=0}^{p\wedge q}}r\left(\genfrac{}{}{0.0pt}{}{p}{r}\right)\left(\genfrac{}{}{0.0pt}{}{q}{r}\right){I_{p+q-2r}}(f{\widetilde{\otimes }_{r}}g);\]
in our case $p=1$, ${f_{{V_{j}}}}{\widetilde{\otimes }_{1}}{f_{{W_{j}}x}^{\otimes p}}=0$, hence we obtain
\[ \frac{1}{\sqrt{n}}{\sum \limits_{j=1}^{n}}I({f_{{V_{j}}}}){I_{p}}\big({f_{{W_{j}}x}^{\otimes p}}\big)={I_{p+1}}\Bigg(\frac{1}{\sqrt{n}}{\sum \limits_{j=1}^{n}}{f_{{V_{j}}}}{\widetilde{\otimes }_{r}}{f_{{W_{j}}x}^{\otimes p}}\Bigg),\]
where $\widetilde{\otimes }$ denotes the symmetrized tensor product. Let us now write
\[ {f_{p+1;x}}:=\frac{1}{\sqrt{n}}\frac{{J_{p}}(\sigma )}{\sqrt{p!}}{\sum \limits_{j=1}^{n}}{f_{{V_{j}}}}{\widetilde{\otimes }_{r}}{f_{{W_{j}}x}^{\otimes p}};\]
it can then be readily checked that, for $K={L^{2}}({\mathbb{S}^{d-1}})$ and $r\lt ({p_{1}}+1\wedge {p_{2}}+1)$,
\[\begin{array}{l}\displaystyle \| {f_{{p_{1}}+1;{x_{1}}}}\otimes {f_{{p_{2}}+1;{x_{2}}}}{\| _{{\mathcal{H}^{\otimes ({p_{1}}+{p_{2}}-2r)}}}^{2}}=\frac{1}{n}\frac{{J_{p}^{4}}(\sigma )}{{(p!)^{2}}}{\langle {x_{1}},{x_{2}}\rangle ^{2r}},\\ {} \displaystyle \| {f_{{p_{1}}+1;{x_{1}}}}\otimes {f_{{p_{2}}+1;{x_{2}}}}{\| _{{\mathcal{H}^{\otimes ({p_{1}}+{p_{2}}-2r)}}\otimes {K^{\otimes 2}}}^{2}}=\frac{1}{n}\frac{{J_{p}^{4}}(\sigma )}{{(p!)^{2}}}{\int _{{\mathbb{S}^{d-1}}\times {\mathbb{S}^{d-1}}}}{\langle {x_{1}},{x_{2}}\rangle ^{2r}}d{x_{1}}d{x_{2}}.\end{array}\]
To complete the proof, it is then sufficient to exploit [8, Theorem 4.3] and to follow similar steps as in the proof of Theorem 1.

A Appendix

A.1 The quantitative functional central limit theorem by Bourguin and Campese (2020)

In this paper, the probabilistic distance for the distance between the random fields we consider is the so-called ${d_{2}}$-metric, which is given by
\[ {d_{2}}(F,G)=\underset{\substack{h\in {C_{b}^{2}}(K)\\ {} \| h{\| _{{C_{b}^{2}}(K)}}\le 1}}{\sup }\big|\mathbb{E}\big[h(F)\big]-\mathbb{E}\big[h(G)\big]\big|;\]
here, ${C_{b}^{2}}(K)$ denotes the space of continuous and bounded applications from the Hilbert space K into $\mathbb{R}$ endowed with two bounded Frechet derivatives ${h^{\prime }}$, ${h^{\prime\prime }}$; that is, for each $h\in {C_{b}^{2}}(K)$ there exist a bounded linear operator ${h^{\prime }}:K\to \mathbb{R}$ such that $\| {h^{\prime }}{\| _{K}}\le 1$
\[ \underset{\| v\| \to 0}{\lim }\frac{|h(x+v)-h(x)-{h^{\prime }}(v)|}{\| v\| }=0,\]
and similarly for the second derivative.
We will use a simplified version of the results by Bourguin and Campese in [8], which we report below.
Theorem 16 (A special case of Theorem 3.10 in [8]).
Let ${F_{\le Q}}\in {L^{2}}(\Omega ,K)$ be a Hilbert-valued random element ${F_{\le Q}}:\Omega \to K$ with zero mean, covariance operator ${S_{\le Q}}$ and such that it can be decomposed into a finite number of Wiener chaoses:
\[ {F_{\le Q}}(.)={\sum \limits_{p=0}^{Q}}{F_{p}}(.).\]
Then, for Z a Gaussian process on the same structure with covariance operator S we have that
\[ {d_{2}}({F_{\le Q}},Z)\le \frac{1}{2}\sqrt{M({F_{\le Q}})+C({F_{\le Q}})}+\| S-{S_{\le Q}}{\| _{{L^{2}}(\Omega ,\mathrm{HS})}},\]
where
\[\begin{aligned}{}M({F_{\le Q}})& =\frac{1}{\sqrt{3}}\sum \limits_{p,q}{c_{p,q}}\sqrt{\mathbb{E}\| {F_{p}}{\| ^{4}}\big(\mathbb{E}\| {F_{q}}{\| ^{4}}-\mathbb{E}\| {Z_{q}}{\| ^{4}}\big)},\\ {} C({F_{\le Q}})& =\sum \limits_{\substack{p,q\\ {} p\ne q}}{c_{p,q}}\operatorname{Cov}\big(\| {F_{p}}{\| ^{2}},\| {F_{q}}{\| ^{2}}\big),\end{aligned}\]
${Z_{q}}$ a centred Gaussian process with the same covariance operator as ${F_{q}}$, i.e., ($\mathbb{E}[{Z_{q}}({x_{1}}){Z_{q}}({x_{2}})]={J_{q}^{2}}(\sigma ){\langle {x_{1}},{x_{2}}\rangle ^{q}}$) and
\[ {c_{p,q}}=\left\{\begin{array}{l@{\hskip10.0pt}l}1+\sqrt{3},\hspace{1em}& p=q,\\ {} \frac{p+q}{2p},\hspace{1em}& p\ne q.\end{array}\right.\]
Remark 17.
The general version of Theorem 3.10 in [8] covers a broader class of processes which can be expanded into the eigenfunctions of Markov operators. We do not need this extra generality, and we refer to [8] for more discussion and details.
We will now review another result by [8], which holds under tighter smoothness conditions. We shall omit a number of details, for which we refer to classical references such as [25].
Given a Hilbert space $\mathcal{H}$ we recall the isonormal process is the collection of zero mean Gaussian random variables with covariance function
\[ \mathbb{E}\big[X({h_{1}})X({h_{2}})\big]={\langle {h_{1}},{h_{2}}\rangle _{\mathcal{H}}}.\]
In our case these random variables take values in the separable Hilbert space ${L^{2}}(\Omega ,{\mathbb{S}^{d-1}})$. For smooth functions $F:\Omega \to {L^{2}}(\Omega ,{\mathbb{S}^{d-1}})$ of the form
\[ F=f\big(W({h_{1}}),\dots ,W({h_{p}})\big)\otimes v,\hspace{2.5pt}f\in {C_{b}^{\infty }}\big({\mathbb{R}^{p}}\big),\hspace{2.5pt}v\in {L^{2}}\big(\Omega ,{\mathbb{S}^{d-1}}\big),\]
we recall that the Malliavin derivative is defined as
\[ DF={\sum \limits_{i=1}^{p}}{\partial _{i}}f\big(W({h_{1}}),\dots ,W({h_{p}})\big){h_{i}}\otimes v\]
whose domain, denoted by ${\mathbb{D}^{1,2}}$, is the closure of the space of smooth functions with respect to the Sobolev norm $\| F{\| _{{L^{2}}(\Omega ,{\mathbb{S}^{d-1}})}^{2}}+\| DF{\| _{{L^{2}}(\Omega ,\mathcal{H}\otimes {\mathbb{S}^{d-1}})}^{2}}$; ${\mathbb{D}^{1,4}}$ is defined analogously.
In this setting, the Wiener chaos decompositions take the form
\[ F={\sum \limits_{p=1}^{\infty }}{I_{p}}({f_{p}}),\hspace{0.2778em}{f_{p}}\in {\mathcal{H}^{\odot p}}\otimes {L^{2}}\big({\mathbb{S}^{d-1}}\big),\]
where ${\mathcal{H}^{\odot p}}$ denotes the p-fold symmetrized tensor product of $\mathcal{H}$, see [8, Subsection 4.1.2]. The main result we are going to exploit is their Theorem 4.3, which we can recall as follows.
Theorem 18 (A special case of Theorem 4.3 in [8]).
Let Z be a centred random element of ${L^{2}}({\mathbb{S}^{d-1}})$ with covariance operator S and $F\in {\mathbb{D}^{1,4}}$ with covariance operator T and chaos decomposition $F={\textstyle\sum _{p}}{I_{p}}({f_{p}})$, where ${f_{p}}\in {\mathcal{H}^{\odot p}}\otimes {L^{2}}({\mathbb{S}^{d-1}})$. Then
\[ {d_{2}}(F,Z)\le \frac{1}{2}\big(\widetilde{M}(F)+\widetilde{C}(F)+\| S-T{\| _{\mathrm{HS}}}\big),\]
where
\[\begin{aligned}{}\widetilde{M}(F)& ={\sum \limits_{p=1}^{\infty }}\sqrt{{\sum \limits_{r=1}^{p-1}}{\widetilde{\Upsilon }_{p,p}^{2}}(r)\| {f_{p}}{\otimes _{r}}{f_{p}}{\| _{{\mathcal{H}^{\otimes (2p-2r)}}\otimes {L^{2}}{({\mathbb{S}^{d-1}})^{\otimes 2}}}^{2}}},\\ {} \widetilde{C}(F)& ={\sum \limits_{1\le p,q\le \infty ,p\ne q}^{\infty }}\sqrt{{\sum \limits_{r=1}^{p\wedge q}}{\widetilde{\Upsilon }_{p,q}^{2}}(r)\| {f_{p}}{\otimes _{r}}{f_{q}}{\| _{{\mathcal{H}^{\otimes (p+q-2r)}}\otimes {L^{2}}{({\mathbb{S}^{d-1}})^{\otimes 2}}}^{2}}},\end{aligned}\]
and
\[ {\widetilde{\Upsilon }_{p,q}}(r)={p^{2}}(r-1)!\left(\genfrac{}{}{0.0pt}{}{p-1}{r-1}\right)\left(\genfrac{}{}{0.0pt}{}{q-1}{r-1}\right)(p+q-2r)!.\]

A.2 The ReLu activation function

We consider here the most popular activation function, i.e., the standard ReLu defined by $\sigma (t)=t{\mathbb{I}_{[0,\infty )}}(t)$. The Hermite expansion is known to be given by (see for instance [13, Lemma 17], or [19, Theorem 2], and [14, 11]):
\[ {J_{q}}(\sigma )=\left\{\begin{array}{l@{\hskip10.0pt}l}\frac{1}{\sqrt{2\pi }},\hspace{1em}& q=0,\\ {} \frac{1}{2},\hspace{1em}& q=1,\\ {} 0,\hspace{1em}& q\gt 1,q\hspace{2.5pt}\text{odd}\\ {} \frac{{(-1)^{\frac{q}{2}+1}}(q-3)!!}{\sqrt{\pi }\sqrt{q!}},\hspace{1em}& q\hspace{2.5pt}\text{even}.\end{array}\right.\]
The following lemma is standard (compare [19]), but we include it for completeness.
Lemma 19.
As $q\to \infty $,
\[ {J_{q}^{2}}\sim \frac{\sqrt{2}}{\sqrt{{\pi ^{3}}}{q^{5/2}}}.\]
Proof.
The result follows from a straightforward application of Stirling’s formula, which gives
\[\begin{aligned}{}q!& \sim \sqrt{2\pi }{q^{q+\frac{1}{2}}}\exp (-q),\\ {} (q-3)!!& =\frac{(q-3)!}{{2^{\frac{q}{2}-2}}(\frac{q}{2}-2)!}\sim \frac{{(q-3)^{q-\frac{5}{2}}}\exp (-q+3)}{{2^{\frac{q}{2}-2}}{(\frac{q}{2}-2)^{\frac{q}{2}-\frac{3}{2}}}\exp (-\frac{q}{2}+2)}\\ {} & =\frac{\exp (-\frac{q}{2}+1)}{{(q-3)^{1/2}}{(\frac{q}{2}-2)^{1/2}}}{\bigg(1+\frac{1}{q-4}\bigg)^{\frac{q}{2}-2}}{(q-3)^{\frac{q}{2}}},\end{aligned}\]
so that
\[\begin{aligned}{}\frac{{((q-3)!!)^{2}}}{\pi (q)!}& \sim \frac{\frac{\exp (-q+2)}{(q-3)(\frac{q}{2}-2)}{(1+\frac{1}{q-4})^{q-4}}{(q-3)^{q}}}{\sqrt{2{\pi ^{3}}}{(q)^{q+\frac{1}{2}}}\exp (-q)}\\ {} & \sim \frac{\exp (3)}{\sqrt{2{\pi ^{3}}}(q-3)(\frac{q}{2}-2)\sqrt{q}}{\bigg(1-\frac{3}{q}\bigg)^{q}}\sim \frac{{2^{1/2}}}{\sqrt{{\pi ^{3}}}{(q)^{5/2}}}.\end{aligned}\]
 □
Remark 20.
The corresponding covariance kernel is given by, for any ${x_{1}},{x_{2}}\in {\mathbb{S}^{d-1}}$,
\[\begin{aligned}{}& \hspace{1em}\hspace{2.5pt}\mathbb{E}\big[\sigma \big({W^{T}}{x_{1}}\big)\sigma \big({W^{T}}{x_{2}}\big)\big]\\ {} & =\frac{1}{2\pi }+\frac{\langle {x_{1}},{x_{2}}\rangle }{4}+\frac{{\langle {x_{1}},{x_{2}}\rangle ^{2}}}{4\pi }+\frac{1}{2\pi }{\sum \limits_{q=2}^{\infty }}\frac{{((2q-3)!!)^{2}}}{(2q)!}{\langle {x_{1}},{x_{2}}\rangle ^{2q}}\\ {} & =\frac{1}{\pi }\big(u(\pi -\arccos u)+\sqrt{1-{u^{2}}}\big),\end{aligned}\]
for $u=\langle {x_{1}},{x_{2}}\rangle $, see also [5].
Remark 21.
The rate for ${J_{q}}$ in Lemma 19 is consistent with the one obtained by [19]. In [13], ${J_{q}^{2}}=O({q^{-3}})$ is given instead, yielding in [13, Theorem 3] the rate
\[ \bigg(\frac{\log d\times \log \log n}{\log n}\bigg).\]
According to Lemma 19, this rate becomes
\[ {\bigg(\frac{\log d\times \log \log n}{\log n}\bigg)^{3/4}},\]
which is the one we actually report in Table 1.

Acknowledgement

This paper has originated by a very inspiring short course taught by Boris Hanin at the University of Rome Tor Vergata in January 2023; we are very grateful to him for many deep insights and illuminating conversations.

References

[1] 
Azmoodeh, E., Peccati, G., Yang, X.: Malliavin-Stein method: a survey of some recent developments. Mod. Stoch. Theory Appl. 8(2), 141–177 (2021). MR4279874. https://doi.org/10.15559/21-vmsta184
[2] 
Bach, F.: Breaking the curse of dimensionality with convex neural networks. J. Mach. Learn. Res. 18(19), 1–53 (2017). MR3634886
[3] 
Basteri, A., Trevisan, D.: Quantitative Gaussian approximation of randomly initialized deep neural networks (2022). arXiv:2203.07379
[4] 
Belkin, M., Hsu, D., Ma, S., Mandal, S.: Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proc. Natl. Acad. Sci. USA 116(32), 15849–15854 (2019). MR3997901. https://doi.org/10.1073/pnas.1903070116
[5] 
Bietti, A., Bach, F.: Deep equals shallow for ReLu networks in kernel regimes. In: International Conference on Learning representations (ICLR), 9 (2021)
[6] 
Bordino, A., Favaro, F., Fortini, S.: Non-asymptotic approximations of Gaussian neural networks via second-order Poincaré inequalities (2023). arXiv:2304.04010
[7] 
Bourguin, S., Campese, S., Leonenko, N., Taqqu, M.S.: Four moments theorems on Markov chaos. Ann. Probab. 47(3), 1417–1446 (2019). MR3945750. https://doi.org/10.1214/18-AOP1287
[8] 
Bourguin, S., Campese, S.: Approximation of Hilbert-valued Gaussians on Dirichlet structures. Electron. J. Probab. 25, 150 (2020), 30 pp. MR4193891. https://doi.org/10.1214/20-ejp551
[9] 
Chizat, L., Oyallon, E., Bach, F.: On lazy training in differentiable programming. In: Advances in Neural Information Processing Systems 32 (NeurIPS 2019) (2019).
[10] 
Cybenko, G.: Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2(4), 303–314 (1989). MR1015670. https://doi.org/10.1007/BF02551274
[11] 
Daniely, A., Frostig, R., Singer, Y.: Toward deeper understanding of neural networks: the power of initialization and a dual view on expressivity. In: NeurIPS 2016, Volume 29, 2253–2261 (2016)
[12] 
Döbler, C., Kasprzak, M., Peccati, G.: The multivariate functional de Jong CLT. Probab. Theory Relat. Fields 184(1–2), 367–399 (2022). MR4498513. https://doi.org/10.1007/s00440-022-01114-3
[13] 
Eldan, R., Mikulincer, D., Schramm, T.: Non-asymptotic approximations of neural networks by Gaussian processes (2021). arXiv:2102.08668
[14] 
Goel, S., Karmalkar, S., Klivans, S.A.: Time/accuracy tradeoffs for learning a ReLu with respect to Gaussian marginals. In: NeurIPS 2019, pp. 8582–8591 (2019)
[15] 
Hanin, B.: Random neural networks in the infinite width limit as Gaussian processes (2021). arXiv:2107.01562
[16] 
Hornik, K.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989). https://doi.org/10.1016/0893-6080(89)90020-8
[17] 
Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural Netw. 4(2), 251–257 (1991). https://doi.org/10.1016/0893-6080(91)90009-T
[18] 
Jacot, A., Gabriel, F., Hongler, C.: Neural tangent kernel: convergence and generalization in neural networks. In: Advances in Neural Information Processing Systems 31 (NeurIPS 2018) (2018).
[19] 
Klukowski, A.: Rate of convergence of polynomial networks to Gaussian processes (2021). arXiv:2111.03175
[20] 
Ledoux, M., Nourdin, I., Peccati, G.: Stein’s method, logarithmic Sobolev and transport inequalities. Geom. Funct. Anal. 25(1), 256–306 (2015). MR3320893. https://doi.org/10.1007/s00039-015-0312-0
[21] 
Leshno, M., Lin, V.Ya., Pinkus, A., Schocken, S.: Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Netw. 6(6), 861–867 (1993). https://doi.org/10.1016/S0893-6080(05)80131-5
[22] 
Marinucci, D., Peccati, G.: Random Fields on the Sphere. Cambridge University Press (2011). MR2840154. https://doi.org/10.1017/CBO9780511751677
[23] 
Neal, R.M.: Priors for infinite networks. In: Bayesian Learning for Neural Networks, pp. 29–53. Springer, New York, NY (1996). https://doi.org/10.1007/978-1-4612-0745-0_2
[24] 
Nourdin, I., Peccati, G.: Stein’s method on Wiener chaos. Probab. Theory Relat. Fields 145(1–2), 75–118 (2009). MR2520122. https://doi.org/10.1007/s00440-008-0162-x
[25] 
Nourdin, I., Peccati, G.: Normal Approximations with Malliavin Calculus. From Stein’s Method to Universality. Cambridge Tracts in Math., vol. 192. Cambridge University Press, Cambridge (2012). MR2962301. https://doi.org/10.1017/CBO9781139084659
[26] 
Pinkus, A.: Approximation theory of the MLP model in neural networks. Acta Numerica (1999). MR1819645. https://doi.org/10.1017/S0962492900002919
[27] 
Rahimi, A., Recht, B.: Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems 20 (NeurIPS 2007) (2007)
[28] 
Roberts, Yaida S, D.A., Hanin, B.: The principles of deep learning theory (2021). arXiv:2106.10165
[29] 
Yaida, S.: Non-Gaussian processes and neural networks at finite widths (2019). arXiv:1910.00019. MR4198759. https://doi.org/10.1007/s40687-020-00233-4
[30] 
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In: 5th International Conference on Learning Representations (ICLR) (2017)
Reading mode PDF XML

Table of contents
  • 1 Introduction and background
  • 2 Main results
  • 3 A comparison with the existing literature
  • 4 Proof of the main results
  • A Appendix
  • Acknowledgement
  • References

Copyright
© 2024 The Author(s). Published by VTeX
by logo by logo
Open access article under the CC BY license.

Keywords
Quantitative functional central limit theorem Wiener-chaos expansions neural networks Gaussian processes

MSC2010
60F17 68T07 60G60

Funding
The work was partially supported by the MUR Excellence Department Project MatMod@TOV awarded to the Department of Mathematics, University of Rome Tor Vergata, CUP E83C18000100006. We also acknowledge financial support from the MUR 2022 PRIN project GRAFIA, project code 202284Z9E4, the INdAM group GNAMPA and the PNRR CN1 High Performance Computing, Spoke 3.

Metrics
since March 2018
906

Article info
views

390

Full article
views

506

PDF
downloads

146

XML
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

  • Tables
    1
  • Theorems
    4
Table 1.
Comparison of convergence rates established by different functional quantitative central limit theorems for several activation functions. Bear in mind that two different metrics ${d_{2}}\le {\mathcal{W}_{2}}$ are considered, ${\mathcal{W}_{2}}$ for [13, 19], and ${d_{2}}$ for this paper. The parameters α and β must satisfy $\alpha \gt 1/2$ and $\beta \gt \log \sqrt{3}$
Theorem 1.
Theorem 4.
Theorem 16 (A special case of Theorem 3.10 in [8]).
Theorem 18 (A special case of Theorem 4.3 in [8]).
Table 1.
Comparison of convergence rates established by different functional quantitative central limit theorems for several activation functions. Bear in mind that two different metrics ${d_{2}}\le {\mathcal{W}_{2}}$ are considered, ${\mathcal{W}_{2}}$ for [13, 19], and ${d_{2}}$ for this paper. The parameters α and β must satisfy $\alpha \gt 1/2$ and $\beta \gt \log \sqrt{3}$
Eldan et al. [13] Klukowski [19] This paper
${J_{q}}\sim {q^{-\alpha }}$ ${(\frac{\log n}{\log \log n\log d})^{-\alpha +\frac{1}{2}}}$ – ${(\log n)^{-\alpha +\frac{1}{2}}}$
ReLu ${(\frac{\log n}{\log \log n\log d})^{-\frac{3}{4}}}$ ${n^{-\frac{3}{4d-2}}}$ ${(\log n)^{-\frac{3}{4}}}$
tanh / logistic $\exp (-c\sqrt{\frac{\log n}{\log d\log \log n}})$ – $\exp (-c\sqrt{\log n})$
${J_{q}}\sim {e^{-\beta q}}$ ${n^{-c{(\log \log n\log d)^{-1}}}}$ – ${n^{-\frac{1}{2}}}$
erf ${n^{-c{(\log \log n\log d)^{-1}}}}$ ${C^{d}}{(\log n)^{\frac{d}{2}-1}}{n^{-\frac{1}{2}}}$ ${n^{-\frac{1}{2}}}$
polynomial order p ${p^{cp}}{d^{\frac{5p}{6}-\frac{1}{12}}}{n^{-\frac{1}{6}}}$ ${(d+p)^{\frac{d}{2}}}{n^{-\frac{1}{2}}}$ ${n^{-\frac{1}{2}}}$
Theorem 1.
Under the previous assumptions and notations, and letting Z be the Gaussian process with zero mean and same covariance as F, we have that, for all $Q\le {\log _{3}}\sqrt{n}$,
(3)
\[ {d_{2}}(F,Z)\le C\| \sigma \| \frac{1}{\sqrt[4]{n}}\sqrt{{\sum \limits_{q=0}^{Q}}{J_{q}^{2}}(\sigma )q{3^{q}}}+\frac{3}{2}\sqrt{{\sum \limits_{q=Q+1}^{\infty }}{J_{q}^{2}}(\sigma )},\]
where C is an absolute constant (in particular, independent of the input dimension d), and $\| \sigma \| $ is the ${L^{2}}$ norm of σ taken with respect to the Gaussian density on $\mathbb{R}$.
Theorem 4.
Under the previous assumptions and notations, and assuming furthermore that $\sigma (Wx)\in {\mathbb{D}^{1,4}}$, we have that, for all $Q\in \mathbb{N}$,
(4)
\[ {d_{2}}(F,Z)\le C\frac{1}{\sqrt{n}}{\sum \limits_{q=0}^{Q}}{J_{q}^{2}}(\sigma )q{3^{q}}\Bigg(\| \sigma {\| ^{2}}+\frac{1}{\sqrt{n}}{\sum \limits_{q=0}^{Q}}{J_{q}^{2}}(\sigma ){3^{q}}\Bigg)+\frac{3}{2}\sqrt{{\sum \limits_{q=Q+1}^{\infty }}{J_{q}^{2}}(\sigma )},\]
where C is an absolute constant (in particular, independend of the input dimension d), and $\| \sigma \| $ is the ${L^{2}}$ norm of σ taken with respect to the Gaussian density on $\mathbb{R}$.
Theorem 16 (A special case of Theorem 3.10 in [8]).
Let ${F_{\le Q}}\in {L^{2}}(\Omega ,K)$ be a Hilbert-valued random element ${F_{\le Q}}:\Omega \to K$ with zero mean, covariance operator ${S_{\le Q}}$ and such that it can be decomposed into a finite number of Wiener chaoses:
\[ {F_{\le Q}}(.)={\sum \limits_{p=0}^{Q}}{F_{p}}(.).\]
Then, for Z a Gaussian process on the same structure with covariance operator S we have that
\[ {d_{2}}({F_{\le Q}},Z)\le \frac{1}{2}\sqrt{M({F_{\le Q}})+C({F_{\le Q}})}+\| S-{S_{\le Q}}{\| _{{L^{2}}(\Omega ,\mathrm{HS})}},\]
where
\[\begin{aligned}{}M({F_{\le Q}})& =\frac{1}{\sqrt{3}}\sum \limits_{p,q}{c_{p,q}}\sqrt{\mathbb{E}\| {F_{p}}{\| ^{4}}\big(\mathbb{E}\| {F_{q}}{\| ^{4}}-\mathbb{E}\| {Z_{q}}{\| ^{4}}\big)},\\ {} C({F_{\le Q}})& =\sum \limits_{\substack{p,q\\ {} p\ne q}}{c_{p,q}}\operatorname{Cov}\big(\| {F_{p}}{\| ^{2}},\| {F_{q}}{\| ^{2}}\big),\end{aligned}\]
${Z_{q}}$ a centred Gaussian process with the same covariance operator as ${F_{q}}$, i.e., ($\mathbb{E}[{Z_{q}}({x_{1}}){Z_{q}}({x_{2}})]={J_{q}^{2}}(\sigma ){\langle {x_{1}},{x_{2}}\rangle ^{q}}$) and
\[ {c_{p,q}}=\left\{\begin{array}{l@{\hskip10.0pt}l}1+\sqrt{3},\hspace{1em}& p=q,\\ {} \frac{p+q}{2p},\hspace{1em}& p\ne q.\end{array}\right.\]
Theorem 18 (A special case of Theorem 4.3 in [8]).
Let Z be a centred random element of ${L^{2}}({\mathbb{S}^{d-1}})$ with covariance operator S and $F\in {\mathbb{D}^{1,4}}$ with covariance operator T and chaos decomposition $F={\textstyle\sum _{p}}{I_{p}}({f_{p}})$, where ${f_{p}}\in {\mathcal{H}^{\odot p}}\otimes {L^{2}}({\mathbb{S}^{d-1}})$. Then
\[ {d_{2}}(F,Z)\le \frac{1}{2}\big(\widetilde{M}(F)+\widetilde{C}(F)+\| S-T{\| _{\mathrm{HS}}}\big),\]
where
\[\begin{aligned}{}\widetilde{M}(F)& ={\sum \limits_{p=1}^{\infty }}\sqrt{{\sum \limits_{r=1}^{p-1}}{\widetilde{\Upsilon }_{p,p}^{2}}(r)\| {f_{p}}{\otimes _{r}}{f_{p}}{\| _{{\mathcal{H}^{\otimes (2p-2r)}}\otimes {L^{2}}{({\mathbb{S}^{d-1}})^{\otimes 2}}}^{2}}},\\ {} \widetilde{C}(F)& ={\sum \limits_{1\le p,q\le \infty ,p\ne q}^{\infty }}\sqrt{{\sum \limits_{r=1}^{p\wedge q}}{\widetilde{\Upsilon }_{p,q}^{2}}(r)\| {f_{p}}{\otimes _{r}}{f_{q}}{\| _{{\mathcal{H}^{\otimes (p+q-2r)}}\otimes {L^{2}}{({\mathbb{S}^{d-1}})^{\otimes 2}}}^{2}}},\end{aligned}\]
and
\[ {\widetilde{\Upsilon }_{p,q}}(r)={p^{2}}(r-1)!\left(\genfrac{}{}{0.0pt}{}{p-1}{r-1}\right)\left(\genfrac{}{}{0.0pt}{}{q-1}{r-1}\right)(p+q-2r)!.\]

MSTA

Journal

  • Online ISSN: 2351-6054
  • Print ISSN: 2351-6046
  • Copyright © 2018 VTeX

About

  • About journal
  • Indexed in
  • Editors-in-Chief

For contributors

  • Submit
  • OA Policy
  • Become a Peer-reviewer

Contact us

  • ejournals-vmsta@vtex.lt
  • Mokslininkų 2A
  • LT-08412 Vilnius
  • Lithuania
Powered by PubliMill  •  Privacy policy