[读书笔记][更新中] Aad van der Vaart "Asymptotic Statistics"

很久很久没上站了。最近找到了research intern。导师目前推进的项目会涉及semi-parametric inference。于是想先自学相关的内容。
读书笔记预计会涵盖 Aad van der Vaart "Asymptotic Statistics" 的Chap.18-20, 25的主要部分。加引号的部分来自原文，其中加粗的文字标注来自我。

Chap.18 Stochastic Convergence in Metric Spaces

“This chapter extends the concepts of convergence in distribution, in probability, and almost surely from Euclidean spaces (object of interest: random vectors in Euclidean spaces) to more abstract metric spaces. We are particularly interested in developing the theory for random functions, or stochastic processes, viewed as elements of the metric space of all bounded functions. (object of interest: random functions in the metric space of all bounded functions) ”

Comment: this chapter will serve as a foundation for the empirical process theory and semi-parametric inference.

18.1 Metric and Normed Spaces

Comment: 这一小节主要是一些基础点集拓扑 (point-set topology) 和测度论 (measure theory) 的概念、定理。没有什么好说的。我本人不是很喜欢van der Vaart的处理，更好的介绍是 Walter Rudin 的 "Principle of Mathematical Analysis" 和 Halsey Royden & Patrick Fitzpatrick 的 "Real Analysis"。这里主要会罗列一些比较重要的点。

Rudin 4.8 Theorem A mapping $f$ of a metric space $X$ into a metric space $Y$ is continuous on $X$ if and only if $f^{-1}(V)$ is open in $X$ for every open set $V$ in $Y$ .

Proof Suppose $f$ is continuous on $X$ and $V$ is an open set in $Y$ . We have to show that every point of $f^{-1}(V)$ is an interior point of $f^{-1}(V)$ . So, suppose $\in X$ and $\in V$ . Since $V$ is open, there exists $ε>0\varepsilon>0$ such that $\in V$ if $dY(f(p),y)<εd_{Y}(f(p), y)<\varepsilon$ ; and since $f$ is continuous at $p$ , there exists $δ>0\delta>0$ such that $dY(f(x),f(p))<εd_{Y}(f(x), f(p))<\varepsilon$ if $dX(x,p)<δd_{X}(x, p)<\delta$ . Thus $\in f^{-1}(V)$ as soon as $dx(x,p)<δd_{x}(x, p)<\delta$ .

Conversely, suppose $f^{-1}(V)$ is open in $X$ for every open set $V$ in $Y$ . Fix $\in X$ and $ε>0\varepsilon>0$ , let $V$ be the set of all $\in Y$ such that $dY(y,f(p))<εd_{Y}(y, f(p))<\varepsilon$ . Then $V$ is open; hence $f^{-1}(V)$ is open; hence there exists $δ>0\delta>0$ such that $\in f^{-1}(V)$ as soon as $dx(p,x)<δd_{x}(p, x)<\delta$ . But if $\in f^{-1}(V)$ , then $\in V$ , so that $dY(f(x),f(p))<εd_{Y}(f(x), f(p))<\varepsilon$ . $■\blacksquare$

今天先写点引子，之后慢慢更。明天还要早起打疫苗和上课，好累

hitori_bocchi

此回复已被删除！

我好像忘登录了。。。本人的ID是@HITORI_BOCCHI

wumingshi

波奇酱好久不见，等你拍第二季等老长时间了

@wumingshi 好久不见

hitori_bocchi

Comment: Rudin 4.8 Theorem 将作为之后一个重要结论的引理。这一小节重要的问题是，为什么 random elements (recall the preface of Chap. 18) 要被定义为一个波莱尔可测映射（Borel-measurable map）？为什么我们要关注波莱尔集 (Borel set) 以及波莱尔可测函数 (Borel-measurable function)？波奇酱的看法是，它们有着相比其他集合和函数更为优越的性质（这将在 18.2 Lemma 中揭晓），使得我们之后会接触的一些重要概念（如期望、概率）被良定义（well-defined）。在证明 18.2 Lemma 前，我们首先给出一些关键定义。

18.1 Definition. The Borel $σ\sigma$ -field on a metric space $D\mathbb{D}$ is the smallest $σ\sigma$ -field that contains the open sets (and then also the closed sets). A function defined relative to (one or two) metric spaces is called Borel-measurable if it is measurable relative to the Borel $σ\sigma$ -field(s). A Borel-measurable map $\Omega \mapsto \mathbb{D}$ defined on a probability space $(Ω,U,P)(\Omega, \mathcal{U}, \mathrm{P})$ is referred to as a random element with values in $D\mathbb{D}$ .

接下来我们证明本小节最重要的引理，它揭示了波莱尔可测性（Borel measurability）极其良好的性质。

18.2 Lemma. A continuous map between metric spaces is Borel-measurable.

Proof A map $\mathbb{D} \mapsto \mathbb{E}$ is continuous if and only if the inverse image $g^{-1}(G)$ of every open set $\subset \mathbb{E}$ is open in $D\mathbb{D}$ (recall Rudin 4.8 Theorem). In particular, for every open $G$ the set $g^{-1}(G)$ is a Borel set in $D\mathbb{D}$ . By definition, the open sets in $E\mathbb{E}$ generate the Borel $σ\sigma$ -field. Thus, the inverse image of a generator of the Borel sets in $E\mathbb{E}$ is contained in the Borel $σ\sigma$ -field in $D\mathbb{D}$ . Because the inverse image $g−1(G)g^{-1}(\mathcal{G})$ of a generator $G\mathcal{G}$ of a $σ\sigma$ -field $B\mathcal{B}$ generates the $σ\sigma$ -field $g−1(B)g^{-1}(\mathcal{B})$ , it follows that the inverse image of every Borel set is a Borel set. $■\blacksquare$

为什么其他的可测集合不具有这样的良好性质呢？且看勒贝格可测集合（Lebesgue-measurable set）的一个例子。

Royden & Fitzpatrick Proposition 2.21 Let $φ\varphi$ be the Cantor-Lebesgue function and define the function $ψ\psi$ on $[0, 1]$ by

\psi(x)=\varphi(x)+x \text { for all } x \in[0,1]

Then $ψ\psi$ is a strictly increasing continuous function that maps $[0, 1]$ onto $[0, 2]$ , and maps a measurable set, a subset of the Cantor set, onto a nonmeasurable set.

Proof The function $ψ\psi$ is continuous since it is the sum of two continuous functions and is strictly increasing since it is the sum of an increasing and a strictly increasing function. Moreover, since $ψ(0)=0\psi(0)=0$ and $ψ(1)=2,ψ([0,1])=[0,2]\psi(1)=2, \psi([0,1])=[0,2]$ . For $O=[0,1]∼C\mathcal{O}=[0,1] \sim C$ , we have the disjoint decomposition

[0,1]=\mathbf{C} \cup \mathcal{O}

which $ψ\psi$ lifts to the disjoint decomposition

[0,2]=\psi(\mathcal{O}) \cup \psi(\mathbf{C})

We note that Vitali's Theorem tells us that $ψ(C)\psi(\mathbf{C})$ contains a set $W$ , which is nonmeasurable. The set $ψ−1(W)\psi^{-1}(W)$ is measurable and has measure zero since it is a subset of the Cantor set. The set $ψ−1(W)\psi^{-1}(W)$ is a measurable subset of the Cantor set, which is mapped by $ψ\psi$ onto a nonmeasurable set. $■\blacksquare$

因此即使 $ψ\psi$ 是连续的，它仍有可能会将一个勒贝格可测集映射到一个不可测集（nonmeasurable set）上，这对我们之后定义期望和概率是致命的（recall our definition of random elements）。

hitori_bocchi

@hitori_bocchi 明天更18.2

你是美本的吗，为什么要对名词进行奇怪的翻译

笔记写的不错，关注了

hitori_bocchi

算1/4个美本（凑满8个字）

hitori_bocchi

谢谢捧场（凑满8个字）

hitori_bocchi

今天太忙了，在读导师要求的文献，周末再更吧

lemma_

支持优质创作（（

nomana

喔喔喔好好好我之后可能也会需要学这个
关注，支持

hitori_bocchi

关于接下来的几次更新: 内容是导师要求看的一篇文献，虽然不是van der Vaart上的，但也是关于semi-parametric model （里面的一些内容应该是来自Aad van der Vaart早年的讲义），因此也放在这里更。

Semiparametric Doubly Robust Targeted Double Machine Learning: A Review by Edward H. Kennedy (Associate Professor, CMU Stats & Data Sciences Dept.)

1.Introduction

"In this review we cover the basics of efficient nonparametric parameter estimation, with a focus on parameters that arise in causal inference problems. We review both efficiency bounds (i.e., what is the best possible performance for estimating a given parameter?) and the analysis of particular estimators (i.e., what is this estimator’s error, and does it attain the efficiency bound?) under weak assumptions."

Comment: Why nonparametric model? 想了解背后的思想可以看Mark van der Laan（UC Berkeley 统计系杰出教授, COPSS奖得主，在半参理论、生存分析等领域有重要贡献）的一篇短文："Why We Need a Statistical Revolution" https://senseaboutscienceusa.org/super-learning-and-the-revolution-in-knowledge/

1.1 Set-up and notations

Suppose we observe a sample of independent observations $,Zn)(Z_1, \cdots, Z_n)$ all identically distributed according to some unknown probability distribution $P\mathbb{P}$ , which is assumed to lie in some model (i.e., set of distributions) $P\mathcal{P}$ . Our goal is to estimate some structured combination of components, called a target parameter $ψ:P→Rq\psi: \mathcal{P}\rightarrow \mathbb{R}^q$ (which is a functional).

At times we subscript expectations and other quantities with the distribution under which they are taken, i.e., $EP(Y∣X=x)\mathbb{E}_{P}(Y \mid X=x)$ for an expectation under distribution $P$ . When the distribution is clear from context, we sometimes omit subscripts; in general, quantities without subscripts are meant to be taken under some generic $P$ in the model, or else under the true distribution $P\mathbb{P}$ . We denote convergence in distribution by $⇝\rightsquigarrow$ and convergence in probability by $→p\stackrel{p}{\rightarrow}$ . We use standard big-oh and little-oh notation, i.e., $Xn=OP(rn)X_{n}=O_{\mathbb{P}}\left(r_{n}\right)$ means $X_{n} / r_{n}$ is bounded in probability and $Xn=oP(rn)X_{n}=o_{\mathbb{P}}\left(r_{n}\right)$ means $Xn/rn→p0X_{n} / r_{n} \stackrel{p}{\rightarrow} 0$ . To ease notation we sometimes omit arguments for functions of multiple arguments, e.g., $φ=φ(z;P)\varphi=\varphi(z ; P)$ when the arguments are clear or secondary to the discussion. We use $Pn\mathbb{P}_{n}$ to denote the empirical measure so that sample averages are written as $Pn(f)=Pn{f(Z)}=1n∑if(Zi)\mathbb{P}_{n}(f)=\mathbb{P}_{n}\{f(Z)\}=\frac{1}{n} \sum_{i} f\left(Z_{i}\right)$ . For a possibly random function $f^\widehat{f}$ , we similarly write $P(f^)=P{f^(Z)}=∫f^(z)dP(z)\mathbb{P}(\widehat{f})=\mathbb{P}\{\widehat{f}(Z)\}=\int \widehat{f}(z) d \mathbb{P}(z)$ , and we let $∥f^∥2=∫f^(z)2dP(z)\|\widehat{f}\|^{2}=\int \widehat{f}(z)^{2} d \mathbb{P}(z)$ denote the squared $L2(P)L_{2}(\mathbb{P})$ norm.

2.Benchmarks: Nonparametric Efficiency Bounds

“After having selected an appropriate target parameter $ψ\psi$ matching the scientific question of interest, identifying (or bounding) it under appropriate causal (在因果推断中，identification指在一些假设下，将target parameter表示为观测数据的函数) or other assumptions, and laying out a statistical model $P\mathcal{P}$ (which in our case will be nonparametric), a next line of business is to understand lower bounds or benchmarks for estimation error. In other words, how well can we possibly hope to estimate the parameter $ψ\psi$ over the model $P\mathcal{P}$ ?

There are two parts to showing optimality: (i) that no estimator can do better than some benchmark, and (ii) that a particular estimator does in fact attain that benchmark. Part (i) is discussed in this section, and part (ii) in the next section.
”
A classic benchmarking or lower bound result for smooth parametric models is the Cramer-Rao bound. In its simplest form, this result states that for smooth parametric models $P={Pθ:θ∈R}\mathcal{P}=\left\{P_{\theta}: \theta \in \mathbb{R}\right\}$ and smooth functionals (i.e., with $PθP_{\theta}$ and $ψ(θ)\psi(\theta)$ differentiable in $θ\theta$ ), the variance of any unbiased estimator $ψ^\widehat{\psi}$ must satisfy

var⁡θ(ψ^)≥ψ′(θ)2var⁡θ{sθ(Z)} \operatorname{var}_{\theta}(\widehat{\psi}) \geq \frac{\psi^{\prime}(\theta)^{2}}{\operatorname{var}_{\theta}\left\{s_{\theta}(Z)\right\}}

where $sθ(z)=∂∂θlog⁡pθ(z)s_{\theta}(z)=\frac{\partial}{\partial \theta} \log p_{\theta}(z)$ is the score function, i.e., no unbiased estimator can have smaller variance than the above ratio.

A standard way to benchmark estimation error more generally is through minimax lower bounds of the form

inf⁡ψ^sup⁡P∈PEP[{ψ^−ψ(P)}2]≥Rn. \inf _{\widehat{\psi}} \sup _{P \in \mathcal{P}} \mathbb{E}_{P}\left[\{\widehat{\psi}-\psi(P)\}^{2}\right] \geq R_{n}.

Comment: 在这里先介绍一下minimax theory，主要参考Larry Wasserman（CMU统计系冠名教授，COPSS奖得主，在非参、贝叶斯等领域有重要贡献）的讲义 https://www.stat.cmu.edu/~larry/=sml/Minimax.pdf

“When solving a statistical learning problem, there are often many procedures to choose from. This leads to the following question: how can we tell if one statistical learning procedure is better than another? One answer is provided by minimax theory which is a set of techniques for finding the minimum, worst case behavior of a procedure.”

Let $P\mathcal{P}$ be a set of distributions and let $X1,…,XnX_{1}, \ldots, X_{n}$ be a sample from some distribution $\in \mathcal{P}$ . Let $θ(P)\theta(P)$ be some function of $P$ . For example, $θ(P)\theta(P)$ could be the mean of $P$ , the variance of $P$ or the density of $P$ . Let $θ^=θ^(X1,…,Xn)\widehat{\theta}=\widehat{\theta}\left(X_{1}, \ldots, X_{n}\right)$ denote an estimator. Given a metric $d$ , the minimax risk is

Rn≡Rn(P)=inf⁡θ^sup⁡P∈PEP[d(θ^,θ(P))] R_{n} \equiv R_{n}(\mathcal{P})=\inf _{\widehat{\theta}} \sup _{P \in \mathcal{P}} \mathbb{E}_{P}[d(\widehat{\theta}, \theta(P))]

where the infimum is over all estimators.

Comment: 如何理解minimax risk？这里参考了一个知乎回答 https://www.zhihu.com/question/347730562/answer/835333769

max指的是 $sup⁡P∈PEP[d(θ^,θ(P))]\sup _{P \in \mathcal{P}} \mathbb{E}_{P}[d(\widehat{\theta}, \theta(P))]$ 是对所有可能的'真实'的 $θ(P)\theta(P)$ 取得的上界，也就是在选定了估计量 $θ^\hat{\theta}$ 之后，让 $θ(P)\theta(P)$ 在statistical model（ $P\mathcal{P}$ ，包含真实分布，即 $P∈P\mathbb{P}\in \mathcal{P}$ ）内到处乱跑来最大化risk，以此来确定这个选定的估计量 $θ^\hat{\theta}$ 可能遇到的最大的risk是多少。

而min才是选取估计量 $θ^\hat{\theta}$ 来得到这些最大的risk的下界。所以实际上一个估计量是minimax的就说明它在最坏情况（无论 $θ(P),P∈P\theta(P), P \in \mathcal{P}$ 怎么取）下的表现 $sup⁡P∈PEP[d(θ^,θ(P))]\sup _{P \in \mathcal{P}} \mathbb{E}_{P}[d(\widehat{\theta}, \theta(P))]$ 也是最好的。

hitori_bocchi

之后慢慢更（凑满8个字）

wumingshi

我感觉某些名词直接用原文，用中英（是不是真的英也无所谓）混写的风格就挺好的，Borel 连我一个纯外行都觉得比音译更自然（

hitori_bocchi

@wumingshi 好的，谢谢建议

hitori_bocchi

"Can the above Cramer-Rao bounds (work for smooth parametric models) be exploited to construct lower bound benchmarks for larger semi- or nonparametric models as well?"

2.1 Parametric Submodels

"The standard way to connect classic Cramer-Rao bounds for parametric models to larger more complicated nonparametric models is through a technical device called the parametric submodel"

Definition 1. A parametric submodel is a smooth parametric model $Pϵ={Pϵ:ϵ∈R}\mathcal{P}_{\epsilon}=\left\{P_{\epsilon}: \epsilon \in \mathbb{R}\right\}$ that satisfies (i) $Pϵ⊆P\mathcal{P}_{\epsilon} \subseteq \mathcal{P}$ , and (ii) $Pϵ=0=PP_{\epsilon=0}=\mathbb{P}$ .

Thus, in words, a parametric submodel is a parametric model that (i) is contained in the larger model $P\mathcal{P}$ of interest, and (ii) equals the true distribution at $ϵ=0\epsilon=0$ , i.e., contains the truth $P\mathbb{P}$ .

The high-level idea behind using submodels is that it is never harder to estimate a parameter over a smaller model, relative to a larger one in which the smaller model is contained. So any lower bound for a submodel will also be a valid lower bound for the larger model $P\mathcal{P}$ .

Comment: 如何理解 “any lower bound for a submodel will also be a valid lower bound for the larger model $P\mathcal{P}$ ”？这里附上van der Vaart "Asymptotic Statistics" Chap. 25 Semiparametric Models 里的讲解。

"To estimate the parameter $ψ(P)\psi(P)$ given the model $P\mathcal{P}$ is certainly harder than to estimate this parameter given that $P$ belongs to a submodel $P0⊂P\mathcal{P}_{0} \subset \mathcal{P}$ . For every smooth parametric submodel $P0={Pθ:θ∈Θ}⊂P\mathcal{P}_{0}=\left\{P_{\theta}: \theta \in \Theta\right\} \subset \mathcal{P}$ , we can calculate the Fisher information for estimating $ψ(Pθ)\psi\left(P_{\theta}\right)$ . Then the information for estimating $ψ(P)\psi(P)$ in the whole model is certainly not bigger than the infimum of the informations over all submodels (recall that The Cramér–Rao bound states that the inverse of the Fisher information is a lower bound on the variance of any unbiased estimator of $θ\theta$ ). We shall simply define the information for the whole model as this infimum. A submodel for which the infimum is taken (if there is one) is called least favorable or a "hardest" submodel."

Comment: （以下内容改写自Mark van der Laan的 STAT C245B Survival Analysis and Causality 的课程材料）The benchmark/lower bound in a minimax theory sense for the target parameter $ψ\psi$ is tightly connected to looking at the derivative of it (the functional derivative). We are interested in the behavior of $ψ\psi$ for local perturbations around $P\mathbb{P}$ . In particular, the derivative of $ψ\psi$ and the steepness of this derivative define the difficulty of the estimation problem. 因此，我们需要一套关于functional derivative的理论。

hitori_bocchi

Directional derivative & pathwise derivative for functional的东西下次再更新。