[读书笔记][更新中] Aad van der Vaart "Asymptotic Statistics"

hitori_bocchi

今天太忙了，在读导师要求的文献，周末再更吧

lemma_

支持优质创作（（

nomana

喔喔喔好好好我之后可能也会需要学这个
关注，支持

hitori_bocchi

关于接下来的几次更新: 内容是导师要求看的一篇文献，虽然不是van der Vaart上的，但也是关于semi-parametric model （里面的一些内容应该是来自Aad van der Vaart早年的讲义），因此也放在这里更。

Semiparametric Doubly Robust Targeted Double Machine Learning: A Review by Edward H. Kennedy (Associate Professor, CMU Stats & Data Sciences Dept.)

1.Introduction

"In this review we cover the basics of efficient nonparametric parameter estimation, with a focus on parameters that arise in causal inference problems. We review both efficiency bounds (i.e., what is the best possible performance for estimating a given parameter?) and the analysis of particular estimators (i.e., what is this estimator’s error, and does it attain the efficiency bound?) under weak assumptions."

Comment: Why nonparametric model? 想了解背后的思想可以看Mark van der Laan（UC Berkeley 统计系杰出教授, COPSS奖得主，在半参理论、生存分析等领域有重要贡献）的一篇短文："Why We Need a Statistical Revolution" https://senseaboutscienceusa.org/super-learning-and-the-revolution-in-knowledge/

1.1 Set-up and notations

Suppose we observe a sample of independent observations $,Zn)(Z_1, \cdots, Z_n)$ all identically distributed according to some unknown probability distribution $P\mathbb{P}$ , which is assumed to lie in some model (i.e., set of distributions) $P\mathcal{P}$ . Our goal is to estimate some structured combination of components, called a target parameter $ψ:P→Rq\psi: \mathcal{P}\rightarrow \mathbb{R}^q$ (which is a functional).

At times we subscript expectations and other quantities with the distribution under which they are taken, i.e., $EP(Y∣X=x)\mathbb{E}_{P}(Y \mid X=x)$ for an expectation under distribution $P$ . When the distribution is clear from context, we sometimes omit subscripts; in general, quantities without subscripts are meant to be taken under some generic $P$ in the model, or else under the true distribution $P\mathbb{P}$ . We denote convergence in distribution by $⇝\rightsquigarrow$ and convergence in probability by $→p\stackrel{p}{\rightarrow}$ . We use standard big-oh and little-oh notation, i.e., $Xn=OP(rn)X_{n}=O_{\mathbb{P}}\left(r_{n}\right)$ means $X_{n} / r_{n}$ is bounded in probability and $Xn=oP(rn)X_{n}=o_{\mathbb{P}}\left(r_{n}\right)$ means $Xn/rn→p0X_{n} / r_{n} \stackrel{p}{\rightarrow} 0$ . To ease notation we sometimes omit arguments for functions of multiple arguments, e.g., $φ=φ(z;P)\varphi=\varphi(z ; P)$ when the arguments are clear or secondary to the discussion. We use $Pn\mathbb{P}_{n}$ to denote the empirical measure so that sample averages are written as $Pn(f)=Pn{f(Z)}=1n∑if(Zi)\mathbb{P}_{n}(f)=\mathbb{P}_{n}\{f(Z)\}=\frac{1}{n} \sum_{i} f\left(Z_{i}\right)$ . For a possibly random function $f^\widehat{f}$ , we similarly write $P(f^)=P{f^(Z)}=∫f^(z)dP(z)\mathbb{P}(\widehat{f})=\mathbb{P}\{\widehat{f}(Z)\}=\int \widehat{f}(z) d \mathbb{P}(z)$ , and we let $∥f^∥2=∫f^(z)2dP(z)\|\widehat{f}\|^{2}=\int \widehat{f}(z)^{2} d \mathbb{P}(z)$ denote the squared $L2(P)L_{2}(\mathbb{P})$ norm.

2.Benchmarks: Nonparametric Efficiency Bounds

“After having selected an appropriate target parameter $ψ\psi$ matching the scientific question of interest, identifying (or bounding) it under appropriate causal (在因果推断中，identification指在一些假设下，将target parameter表示为观测数据的函数) or other assumptions, and laying out a statistical model $P\mathcal{P}$ (which in our case will be nonparametric), a next line of business is to understand lower bounds or benchmarks for estimation error. In other words, how well can we possibly hope to estimate the parameter $ψ\psi$ over the model $P\mathcal{P}$ ?

There are two parts to showing optimality: (i) that no estimator can do better than some benchmark, and (ii) that a particular estimator does in fact attain that benchmark. Part (i) is discussed in this section, and part (ii) in the next section.
”
A classic benchmarking or lower bound result for smooth parametric models is the Cramer-Rao bound. In its simplest form, this result states that for smooth parametric models $P={Pθ:θ∈R}\mathcal{P}=\left\{P_{\theta}: \theta \in \mathbb{R}\right\}$ and smooth functionals (i.e., with $PθP_{\theta}$ and $ψ(θ)\psi(\theta)$ differentiable in $θ\theta$ ), the variance of any unbiased estimator $ψ^\widehat{\psi}$ must satisfy

var⁡θ(ψ^)≥ψ′(θ)2var⁡θ{sθ(Z)} \operatorname{var}_{\theta}(\widehat{\psi}) \geq \frac{\psi^{\prime}(\theta)^{2}}{\operatorname{var}_{\theta}\left\{s_{\theta}(Z)\right\}}

where $sθ(z)=∂∂θlog⁡pθ(z)s_{\theta}(z)=\frac{\partial}{\partial \theta} \log p_{\theta}(z)$ is the score function, i.e., no unbiased estimator can have smaller variance than the above ratio.

A standard way to benchmark estimation error more generally is through minimax lower bounds of the form

inf⁡ψ^sup⁡P∈PEP[{ψ^−ψ(P)}2]≥Rn. \inf _{\widehat{\psi}} \sup _{P \in \mathcal{P}} \mathbb{E}_{P}\left[\{\widehat{\psi}-\psi(P)\}^{2}\right] \geq R_{n}.

Comment: 在这里先介绍一下minimax theory，主要参考Larry Wasserman（CMU统计系冠名教授，COPSS奖得主，在非参、贝叶斯等领域有重要贡献）的讲义 https://www.stat.cmu.edu/~larry/=sml/Minimax.pdf

“When solving a statistical learning problem, there are often many procedures to choose from. This leads to the following question: how can we tell if one statistical learning procedure is better than another? One answer is provided by minimax theory which is a set of techniques for finding the minimum, worst case behavior of a procedure.”

Let $P\mathcal{P}$ be a set of distributions and let $X1,…,XnX_{1}, \ldots, X_{n}$ be a sample from some distribution $\in \mathcal{P}$ . Let $θ(P)\theta(P)$ be some function of $P$ . For example, $θ(P)\theta(P)$ could be the mean of $P$ , the variance of $P$ or the density of $P$ . Let $θ^=θ^(X1,…,Xn)\widehat{\theta}=\widehat{\theta}\left(X_{1}, \ldots, X_{n}\right)$ denote an estimator. Given a metric $d$ , the minimax risk is

Rn≡Rn(P)=inf⁡θ^sup⁡P∈PEP[d(θ^,θ(P))] R_{n} \equiv R_{n}(\mathcal{P})=\inf _{\widehat{\theta}} \sup _{P \in \mathcal{P}} \mathbb{E}_{P}[d(\widehat{\theta}, \theta(P))]

where the infimum is over all estimators.

Comment: 如何理解minimax risk？这里参考了一个知乎回答 https://www.zhihu.com/question/347730562/answer/835333769

max指的是 $sup⁡P∈PEP[d(θ^,θ(P))]\sup _{P \in \mathcal{P}} \mathbb{E}_{P}[d(\widehat{\theta}, \theta(P))]$ 是对所有可能的'真实'的 $θ(P)\theta(P)$ 取得的上界，也就是在选定了估计量 $θ^\hat{\theta}$ 之后，让 $θ(P)\theta(P)$ 在statistical model（ $P\mathcal{P}$ ，包含真实分布，即 $P∈P\mathbb{P}\in \mathcal{P}$ ）内到处乱跑来最大化risk，以此来确定这个选定的估计量 $θ^\hat{\theta}$ 可能遇到的最大的risk是多少。

而min才是选取估计量 $θ^\hat{\theta}$ 来得到这些最大的risk的下界。所以实际上一个估计量是minimax的就说明它在最坏情况（无论 $θ(P),P∈P\theta(P), P \in \mathcal{P}$ 怎么取）下的表现 $sup⁡P∈PEP[d(θ^,θ(P))]\sup _{P \in \mathcal{P}} \mathbb{E}_{P}[d(\widehat{\theta}, \theta(P))]$ 也是最好的。

hitori_bocchi

之后慢慢更（凑满8个字）

wumingshi

我感觉某些名词直接用原文，用中英（是不是真的英也无所谓）混写的风格就挺好的，Borel 连我一个纯外行都觉得比音译更自然（

hitori_bocchi

@wumingshi 好的，谢谢建议

hitori_bocchi

"Can the above Cramer-Rao bounds (work for smooth parametric models) be exploited to construct lower bound benchmarks for larger semi- or nonparametric models as well?"

2.1 Parametric Submodels

"The standard way to connect classic Cramer-Rao bounds for parametric models to larger more complicated nonparametric models is through a technical device called the parametric submodel"

Definition 1. A parametric submodel is a smooth parametric model $Pϵ={Pϵ:ϵ∈R}\mathcal{P}_{\epsilon}=\left\{P_{\epsilon}: \epsilon \in \mathbb{R}\right\}$ that satisfies (i) $Pϵ⊆P\mathcal{P}_{\epsilon} \subseteq \mathcal{P}$ , and (ii) $Pϵ=0=PP_{\epsilon=0}=\mathbb{P}$ .

Thus, in words, a parametric submodel is a parametric model that (i) is contained in the larger model $P\mathcal{P}$ of interest, and (ii) equals the true distribution at $ϵ=0\epsilon=0$ , i.e., contains the truth $P\mathbb{P}$ .

The high-level idea behind using submodels is that it is never harder to estimate a parameter over a smaller model, relative to a larger one in which the smaller model is contained. So any lower bound for a submodel will also be a valid lower bound for the larger model $P\mathcal{P}$ .

Comment: 如何理解 “any lower bound for a submodel will also be a valid lower bound for the larger model $P\mathcal{P}$ ”？这里附上van der Vaart "Asymptotic Statistics" Chap. 25 Semiparametric Models 里的讲解。

"To estimate the parameter $ψ(P)\psi(P)$ given the model $P\mathcal{P}$ is certainly harder than to estimate this parameter given that $P$ belongs to a submodel $P0⊂P\mathcal{P}_{0} \subset \mathcal{P}$ . For every smooth parametric submodel $P0={Pθ:θ∈Θ}⊂P\mathcal{P}_{0}=\left\{P_{\theta}: \theta \in \Theta\right\} \subset \mathcal{P}$ , we can calculate the Fisher information for estimating $ψ(Pθ)\psi\left(P_{\theta}\right)$ . Then the information for estimating $ψ(P)\psi(P)$ in the whole model is certainly not bigger than the infimum of the informations over all submodels (recall that The Cramér–Rao bound states that the inverse of the Fisher information is a lower bound on the variance of any unbiased estimator of $θ\theta$ ). We shall simply define the information for the whole model as this infimum. A submodel for which the infimum is taken (if there is one) is called least favorable or a "hardest" submodel."

Comment: （以下内容改写自Mark van der Laan的 STAT C245B Survival Analysis and Causality 的课程材料）The benchmark/lower bound in a minimax theory sense for the target parameter $ψ\psi$ is tightly connected to looking at the derivative of it (the functional derivative). We are interested in the behavior of $ψ\psi$ for local perturbations around $P\mathbb{P}$ . In particular, the derivative of $ψ\psi$ and the steepness of this derivative define the difficulty of the estimation problem. 因此，我们需要一套关于functional derivative的理论。

hitori_bocchi

Directional derivative & pathwise derivative for functional的东西下次再更新。

hitori_bocchi

半参后面的理论确实有点复杂，会涉及一些泛函的东西，我不打算写的过于理论，更多还是intuition吧

hitori_bocchi

不打算更directional derivative & pathwise derivative for functional和更多的半参理论的东西了，实在是太多了，够写一本书的。这篇文章本来也是一个偏实用的指南，还是不想太偏离主旨...

hitori_bocchi

It turns out that, for the purposes of constructing lower bound benchmarks for functional estimation, it often suffices to use one-dimensional parametric submodels. A common choice of submodel for nonparametric $P\mathcal{P}$ is, for some mean-zero function $\mathcal{Z} \rightarrow \mathbb{R}$ ,

p_{\epsilon}(z)=d \mathbb{P}(z)\{1+\epsilon h(z)\}

where $∥h∥∞≤M<∞\|h\|_{\infty} \leq M<\infty$ and $ϵ<1/M\epsilon<1 / M$ so that $pϵ(z)≥0p_{\epsilon}(z) \geq 0$ . Note for this submodel the score function is $∂∂ϵlog⁡pϵ(z)∣ϵ=0=∂∂ϵlog⁡{1+ϵh(z)}∣ϵ=0=h(z)\left.\frac{\partial}{\partial \epsilon} \log p_{\epsilon}(z)\right|_{\epsilon=0}=\left.\frac{\partial}{\partial \epsilon} \log \{1+\epsilon h(z)\}\right|_{\epsilon=0}=h(z)$ . Therefore the Cramer-Rao lower bound for some $PϵP_{\epsilon}$ in the example one-dimensional submodel $Pϵ\mathcal{P}_{\epsilon}$ above is given by

\frac{\psi^{\prime}\left(P_{\epsilon}\right)^{2}}{\operatorname{var}_{P_{\epsilon}}\left\{s_{\epsilon}(Z)\right\}}=\frac{\left\{\left.\frac{\partial}{\partial \epsilon} \psi\left(P_{\epsilon}\right)\right|_{\epsilon=0}\right\}^{2}}{\mathbb{E}_{P_{\epsilon}}\left\{h(Z)^{2}\right\}}.

Comment: Why one-dimensional submodel? 详细的说明见 Michael Kosorok "Introduction to Empirical Processes and Semiparametric Inference" Chap. 18。

还需要说明的一点是为什么我们选择了 $pϵ(z)=dP(z){1+ϵh(z)}p_{\epsilon}(z)=d \mathbb{P}(z)\{1+\epsilon h(z)\}$ 作为submodel（以下内容改写自Mark van der Laan的 STAT C245B Survival Analysis and Causality 的课程材料）。

We want to define a type of differentiability of $ψ:P→Rq\psi: \mathcal{P} \rightarrow \mathbb{R}^{q}$ , where $ψ\psi$ is the target parameter.

We could use the definition of a directional derivative in direction $h$ :

\psi(\mathbb{P})(h)=\left.\frac{d}{d \epsilon} \psi(\mathbb{P}+\epsilon h)\right|_{\epsilon=0}

However, $P+ϵh\mathbb{P}+\epsilon h$ might not be a path through $P\mathcal{P}$ , and thus ill defined. We need to define a derivative along paths that are submodels of $P\mathcal{P}$ .

Let $P\mathcal{P}$ be nonparametric. We define a class of paths such that:

p_{\epsilon}(z)=d \mathbb{P}(z)\{1+\epsilon h(z)\}

Two key assumptions necessary for it to be a proper submodel are as follows:

$h$ is uniformly bounded
$EPh(z)=0\mathbb{E}_{P} h(z)=0$

For $ϵ∈(−δ,δ)\epsilon \in(-\delta, \delta)$ with $δ=1∥h∥∞\delta=\frac{1}{\|h\|_{\infty}}$ , this is a submodel.

To see why, first note that for the paths to be a proper density, we need:

$\mathbb{P}(z) \{1+\epsilon h(z)\} \geqslant 0$

Sketch proof:

Let $h (z)$ be uniformly bounded and $h(z)=∥h∥∞h(z)=\|h\|_{\infty}$ . If $ϵ⩽∣δ∣,{1+ϵh(z)}⩾0\epsilon \leqslant|\delta|, \{1+\epsilon h(z)\} \geqslant 0$ . Therefore, for $ϵ\epsilon$ sufficiently small and $h$ uniformly bounded, $\mathbb{P}(z) \{1+\epsilon h(z)\} \geqslant 0$ .

$∫{1+ϵh(z)}dP(z)=1\int \{1+\epsilon h(z)\} d \mathbb{P}(z)=1$

Sketch proof:

Note that $∫{1+ϵh(z)}dP(z)=∫dP(z)+ϵ∫h(z)dP(z)=1\int\{1+\epsilon h(z)\} d \mathbb{P}(z)=\int d \mathbb{P}(z)+\epsilon \int h(z) d \mathbb{P}(z)=1$ since $p$ is a proper density and $∫h(z)dP(z)=EPh(z)=0\int h(z) d \mathbb{P}(z)=\mathbb{E}_{P} h(z)=0$ by assumption.

Now consider the score of this submodel.

\begin{aligned} \left.\frac{\delta}{\delta \epsilon} \log \frac{d P_{\epsilon}}{d \mathbb{P}}\right|_{\epsilon=0} & =\left.\frac{\delta}{\delta \epsilon} \log \{1+\epsilon h(z)\}\right|_{\epsilon=0} \\ & =\left.\frac{h(z)}{1+\epsilon h(z)}\right|_{\epsilon=0} \\ & =h(z). \end{aligned}

"Since any lower bound for the submodel $Pϵ\mathcal{P}_{\epsilon}$ is also a lower bound for $P\mathcal{P}$ , the best and most informative is the greatest such lower bound. Can we say anything about the best such lower bound for generic functionals and/or submodels?"

2.2 Pathwise Differentiability

Recall the Cramer-Rao bound

\frac{\left\{\left.\frac{\partial}{\partial \epsilon} \psi\left(P_{\epsilon}\right)\right|_{\epsilon=0}\right\}^{2}}{\mathbb{E}_{P_{\epsilon}}\left\{s_{\epsilon}(Z)^{2}\right\}}

for submodel $Pϵ\mathcal{P}_{\epsilon}$ described in the previous subsection. To find the best such lower bound, we would like to optimize the above over all $PϵP_{\epsilon}$ in some submodels. It is not a priori clear how generally this can be accomplished, since different functionals $ψ\psi$ could yield very different numerators. Therefore let us first consider what we can say about the derivative in the numerator, for a large class of pathwise differentiable functionals.

Namely, suppose the functional $ψ:P↦R\psi: \mathcal{P} \mapsto \mathbb{R}$ is smooth, as a map from distributions to the reals, in the sense that it admits a kind of distributional Taylor expansion

\psi(\bar{P})-\psi(P)=\int \varphi(z ; \bar{P}) d(\bar{P}-P)(z)+R_{2}(\bar{P}, P)

for distributions $Pˉ\bar{P}$ and $P$ , often called a von Mises expansion, where $φ(z;P)\varphi(z ; P)$ is a mean-

zero, finite-variance function satisfying $∫φ(z;P)dP(z)=0\int \varphi(z ; P) d P(z)=0$ and $∫φ(z;P)2dP(z)<∞\int \varphi(z ; P)^{2} d P(z)<\infty$ , and $R2(Pˉ,P)R_{2}(\bar{P}, P)$ is a second-order remainder term (which means it only depends on products or squares of differences between $Pˉ\bar{P}$ and $P)$ .

Intuitively, the von Mises expansion above is just an infinite-dimensional or distributional analog of a Taylor expansion, with $φ(z;Q)\varphi(z ; Q)$ acting as a usual derivative term; it describes how the functional $ψ\psi$ changes locally when the distribution changes from $P$ to $Pˉ\bar{P}$ . For example, when $\in\{1, \ldots, k\}$ is discrete and so $Pˉ\bar{P}$ and $P$ have $k$ countable components, the von Mises expansion reduces to a standard multivariate Taylor expansion with

R_{2}(\bar{P}, P)=\psi\left(\bar{p}_{1}, \ldots, \bar{p}_{k}\right)-\psi\left(p_{1}, \ldots, p_{k}\right)-\left.\sum_{j} \frac{\partial}{\partial t_{j}} \psi\left(t_{1}, \ldots, t_{k}\right)\right|_{t=\bar{p}}\left(\bar{p}_{j}-p_{j}\right).

hitori_bocchi

今天就更到这里，去看碧蓝档案3周年fes了（

nomana

我爱你🤟波奇酱

hitori_bocchi

@nomana 谢谢~

nomana

波奇酱好久没更了

hitori_bocchi

@nomana 最近科研太忙了，等放假了再更

hashhash

@hitori_bocchi 原来你也是档友!

hitori_bocchi

@hashhash 阿罗娜可爱!

hitori_bocchi

@hashhash 可以加贵校碧蓝档案群155199376聊天吹水