[读书笔记][更新中] Aad van der Vaart "Asymptotic Statistics"
-
今天太忙了,在读导师要求的文献,周末再更吧
-
支持优质创作((
-
喔喔喔好好好我之后可能也会需要学这个
关注,支持 -
关于接下来的几次更新: 内容是导师要求看的一篇文献,虽然不是van der Vaart上的,但也是关于semi-parametric model (里面的一些内容应该是来自Aad van der Vaart早年的讲义),因此也放在这里更。
Semiparametric Doubly Robust Targeted Double Machine Learning: A Review by Edward H. Kennedy (Associate Professor, CMU Stats & Data Sciences Dept.)
1.Introduction
"In this review we cover the basics of efficient nonparametric parameter estimation, with a focus on parameters that arise in causal inference problems. We review both efficiency bounds (i.e., what is the best possible performance for estimating a given parameter?) and the analysis of particular estimators (i.e., what is this estimator’s error, and does it attain the efficiency bound?) under weak assumptions."
Comment: Why nonparametric model? 想了解背后的思想可以看Mark van der Laan(UC Berkeley 统计系杰出教授, COPSS奖得主,在半参理论、生存分析等领域有重要贡献)的一篇短文:"Why We Need a Statistical Revolution" https://senseaboutscienceusa.org/super-learning-and-the-revolution-in-knowledge/
1.1 Set-up and notations
Suppose we observe a sample of independent observations (Z1,⋯ ,Zn)(Z_1, \cdots, Z_n)(Z1,⋯,Zn) all identically distributed according to some unknown probability distribution P\mathbb{P}P, which is assumed to lie in some model (i.e., set of distributions) P\mathcal{P}P. Our goal is to estimate some structured combination of components, called a target parameter ψ:P→Rq\psi: \mathcal{P}\rightarrow \mathbb{R}^qψ:P→Rq (which is a functional).
At times we subscript expectations and other quantities with the distribution under which they are taken, i.e., EP(Y∣X=x)\mathbb{E}_{P}(Y \mid X=x)EP(Y∣X=x) for an expectation under distribution PPP. When the distribution is clear from context, we sometimes omit subscripts; in general, quantities without subscripts are meant to be taken under some generic PPP in the model, or else under the true distribution P\mathbb{P}P. We denote convergence in distribution by ⇝\rightsquigarrow⇝ and convergence in probability by →p\stackrel{p}{\rightarrow}→p. We use standard big-oh and little-oh notation, i.e., Xn=OP(rn)X_{n}=O_{\mathbb{P}}\left(r_{n}\right)Xn=OP(rn) means Xn/rnX_{n} / r_{n}Xn/rn is bounded in probability and Xn=oP(rn)X_{n}=o_{\mathbb{P}}\left(r_{n}\right)Xn=oP(rn) means Xn/rn→p0X_{n} / r_{n} \stackrel{p}{\rightarrow} 0Xn/rn→p0. To ease notation we sometimes omit arguments for functions of multiple arguments, e.g., φ=φ(z;P)\varphi=\varphi(z ; P)φ=φ(z;P) when the arguments are clear or secondary to the discussion. We use Pn\mathbb{P}_{n}Pn to denote the empirical measure so that sample averages are written as Pn(f)=Pn{f(Z)}=1n∑if(Zi)\mathbb{P}_{n}(f)=\mathbb{P}_{n}\{f(Z)\}=\frac{1}{n} \sum_{i} f\left(Z_{i}\right)Pn(f)=Pn{f(Z)}=n1∑if(Zi). For a possibly random function f^\widehat{f}f, we similarly write P(f^)=P{f^(Z)}=∫f^(z)dP(z)\mathbb{P}(\widehat{f})=\mathbb{P}\{\widehat{f}(Z)\}=\int \widehat{f}(z) d \mathbb{P}(z)P(f)=P{f(Z)}=∫f(z)dP(z), and we let ∥f^∥2=∫f^(z)2dP(z)\|\widehat{f}\|^{2}=\int \widehat{f}(z)^{2} d \mathbb{P}(z)∥f∥2=∫f(z)2dP(z) denote the squared L2(P)L_{2}(\mathbb{P})L2(P) norm.
2.Benchmarks: Nonparametric Efficiency Bounds
“After having selected an appropriate target parameter ψ\psiψ matching the scientific question of interest, identifying (or bounding) it under appropriate causal (在因果推断中,identification指在一些假设下,将target parameter表示为观测数据的函数) or other assumptions, and laying out a statistical model P\mathcal{P}P (which in our case will be nonparametric), a next line of business is to understand lower bounds or benchmarks for estimation error. In other words, how well can we possibly hope to estimate the parameter ψ\psiψ over the model P\mathcal{P}P?
There are two parts to showing optimality: (i) that no estimator can do better than some benchmark, and (ii) that a particular estimator does in fact attain that benchmark. Part (i) is discussed in this section, and part (ii) in the next section.
”
A classic benchmarking or lower bound result for smooth parametric models is the Cramer-Rao bound. In its simplest form, this result states that for smooth parametric models P={Pθ:θ∈R}\mathcal{P}=\left\{P_{\theta}: \theta \in \mathbb{R}\right\}P={Pθ:θ∈R} and smooth functionals (i.e., with PθP_{\theta}Pθ and ψ(θ)\psi(\theta)ψ(θ) differentiable in θ\thetaθ ), the variance of any unbiased estimator ψ^\widehat{\psi}ψ must satisfyvarθ(ψ^)≥ψ′(θ)2varθ{sθ(Z)} \operatorname{var}_{\theta}(\widehat{\psi}) \geq \frac{\psi^{\prime}(\theta)^{2}}{\operatorname{var}_{\theta}\left\{s_{\theta}(Z)\right\}} varθ(ψ)≥varθ{sθ(Z)}ψ′(θ)2 where sθ(z)=∂∂θlogpθ(z)s_{\theta}(z)=\frac{\partial}{\partial \theta} \log p_{\theta}(z)sθ(z)=∂θ∂logpθ(z) is the score function, i.e., no unbiased estimator can have smaller variance than the above ratio.
A standard way to benchmark estimation error more generally is through minimax lower bounds of the form
infψ^supP∈PEP[{ψ^−ψ(P)}2]≥Rn. \inf _{\widehat{\psi}} \sup _{P \in \mathcal{P}} \mathbb{E}_{P}\left[\{\widehat{\psi}-\psi(P)\}^{2}\right] \geq R_{n}. ψinfP∈PsupEP[{ψ−ψ(P)}2]≥Rn. Comment: 在这里先介绍一下minimax theory,主要参考Larry Wasserman(CMU统计系冠名教授,COPSS奖得主,在非参、贝叶斯等领域有重要贡献)的讲义 https://www.stat.cmu.edu/~larry/=sml/Minimax.pdf
“When solving a statistical learning problem, there are often many procedures to choose from. This leads to the following question: how can we tell if one statistical learning procedure is better than another? One answer is provided by minimax theory which is a set of techniques for finding the minimum, worst case behavior of a procedure.”
Let P\mathcal{P}P be a set of distributions and let X1,…,XnX_{1}, \ldots, X_{n}X1,…,Xn be a sample from some distribution P∈PP \in \mathcal{P}P∈P. Let θ(P)\theta(P)θ(P) be some function of PPP. For example, θ(P)\theta(P)θ(P) could be the mean of PPP, the variance of PPP or the density of PPP. Let θ^=θ^(X1,…,Xn)\widehat{\theta}=\widehat{\theta}\left(X_{1}, \ldots, X_{n}\right)θ=θ(X1,…,Xn) denote an estimator. Given a metric ddd, the minimax risk is
Rn≡Rn(P)=infθ^supP∈PEP[d(θ^,θ(P))] R_{n} \equiv R_{n}(\mathcal{P})=\inf _{\widehat{\theta}} \sup _{P \in \mathcal{P}} \mathbb{E}_{P}[d(\widehat{\theta}, \theta(P))] Rn≡Rn(P)=θinfP∈PsupEP[d(θ,θ(P))] where the infimum is over all estimators.
Comment: 如何理解minimax risk?这里参考了一个知乎回答 https://www.zhihu.com/question/347730562/answer/835333769
max指的是 supP∈PEP[d(θ^,θ(P))]\sup _{P \in \mathcal{P}} \mathbb{E}_{P}[d(\widehat{\theta}, \theta(P))]supP∈PEP[d(θ,θ(P))] 是对所有可能的'真实'的 θ(P)\theta(P)θ(P) 取得的上界,也就是在选定了估计量 θ^\hat{\theta}θ^ 之后,让 θ(P)\theta(P)θ(P) 在statistical model(P\mathcal{P}P,包含真实分布,即 P∈P\mathbb{P}\in \mathcal{P}P∈P)内到处乱跑来最大化risk,以此来确定这个选定的估计量 θ^\hat{\theta}θ^ 可能遇到的最大的risk是多少。
而min才是选取估计量 θ^\hat{\theta}θ^ 来得到这些最大的risk的下界。所以实际上一个估计量是minimax的就说明它在最坏情况(无论 θ(P),P∈P\theta(P), P \in \mathcal{P}θ(P),P∈P 怎么取)下的表现 supP∈PEP[d(θ^,θ(P))]\sup _{P \in \mathcal{P}} \mathbb{E}_{P}[d(\widehat{\theta}, \theta(P))]supP∈PEP[d(θ,θ(P))] 也是最好的。
-
之后慢慢更(凑满8个字)
-
我感觉某些名词直接用原文,用中英(是不是真的英也无所谓)混写的风格就挺好的,Borel 连我一个纯外行都觉得比音译更自然(
-
@wumingshi 好的,谢谢建议
-
"Can the above Cramer-Rao bounds (work for smooth parametric models) be exploited to construct lower bound benchmarks for larger semi- or nonparametric models as well?"
2.1 Parametric Submodels
"The standard way to connect classic Cramer-Rao bounds for parametric models to larger more complicated nonparametric models is through a technical device called the parametric submodel"
Definition 1. A parametric submodel is a smooth parametric model Pϵ={Pϵ:ϵ∈R}\mathcal{P}_{\epsilon}=\left\{P_{\epsilon}: \epsilon \in \mathbb{R}\right\}Pϵ={Pϵ:ϵ∈R} that satisfies (i) Pϵ⊆P\mathcal{P}_{\epsilon} \subseteq \mathcal{P}Pϵ⊆P, and (ii) Pϵ=0=PP_{\epsilon=0}=\mathbb{P}Pϵ=0=P.
Thus, in words, a parametric submodel is a parametric model that (i) is contained in the larger model P\mathcal{P}P of interest, and (ii) equals the true distribution at ϵ=0\epsilon=0ϵ=0, i.e., contains the truth P\mathbb{P}P.
The high-level idea behind using submodels is that it is never harder to estimate a parameter over a smaller model, relative to a larger one in which the smaller model is contained. So any lower bound for a submodel will also be a valid lower bound for the larger model P\mathcal{P}P.
Comment: 如何理解 “any lower bound for a submodel will also be a valid lower bound for the larger model P\mathcal{P}P”?这里附上van der Vaart "Asymptotic Statistics" Chap. 25 Semiparametric Models 里的讲解。
"To estimate the parameter ψ(P)\psi(P)ψ(P) given the model P\mathcal{P}P is certainly harder than to estimate this parameter given that PPP belongs to a submodel P0⊂P\mathcal{P}_{0} \subset \mathcal{P}P0⊂P. For every smooth parametric submodel P0={Pθ:θ∈Θ}⊂P\mathcal{P}_{0}=\left\{P_{\theta}: \theta \in \Theta\right\} \subset \mathcal{P}P0={Pθ:θ∈Θ}⊂P, we can calculate the Fisher information for estimating ψ(Pθ)\psi\left(P_{\theta}\right)ψ(Pθ). Then the information for estimating ψ(P)\psi(P)ψ(P) in the whole model is certainly not bigger than the infimum of the informations over all submodels (recall that The Cramér–Rao bound states that the inverse of the Fisher information is a lower bound on the variance of any unbiased estimator of θ\thetaθ). We shall simply define the information for the whole model as this infimum. A submodel for which the infimum is taken (if there is one) is called least favorable or a "hardest" submodel."
Comment: (以下内容改写自Mark van der Laan的 STAT C245B Survival Analysis and Causality 的课程材料)The benchmark/lower bound in a minimax theory sense for the target parameter ψ\psiψ is tightly connected to looking at the derivative of it (the functional derivative). We are interested in the behavior of ψ\psiψ for local perturbations around P\mathbb{P}P. In particular, the derivative of ψ\psiψ and the steepness of this derivative define the difficulty of the estimation problem. 因此,我们需要一套关于functional derivative的理论。
-
Directional derivative & pathwise derivative for functional的东西下次再更新。
-
半参后面的理论确实有点复杂,会涉及一些泛函的东西,我不打算写的过于理论,更多还是intuition吧
-
不打算更directional derivative & pathwise derivative for functional和更多的半参理论的东西了,实在是太多了,够写一本书的。这篇文章本来也是一个偏实用的指南,还是不想太偏离主旨...
-
It turns out that, for the purposes of constructing lower bound benchmarks for functional estimation, it often suffices to use one-dimensional parametric submodels. A common choice of submodel for nonparametric P\mathcal{P}P is, for some mean-zero function h:Z→Rh: \mathcal{Z} \rightarrow \mathbb{R}h:Z→R,
pϵ(z)=dP(z){1+ϵh(z)} p_{\epsilon}(z)=d \mathbb{P}(z)\{1+\epsilon h(z)\} pϵ(z)=dP(z){1+ϵh(z)} where ∥h∥∞≤M<∞\|h\|_{\infty} \leq M<\infty∥h∥∞≤M<∞ and ϵ<1/M\epsilon<1 / Mϵ<1/M so that pϵ(z)≥0p_{\epsilon}(z) \geq 0pϵ(z)≥0. Note for this submodel the score function is ∂∂ϵlogpϵ(z)∣ϵ=0=∂∂ϵlog{1+ϵh(z)}∣ϵ=0=h(z)\left.\frac{\partial}{\partial \epsilon} \log p_{\epsilon}(z)\right|_{\epsilon=0}=\left.\frac{\partial}{\partial \epsilon} \log \{1+\epsilon h(z)\}\right|_{\epsilon=0}=h(z)∂ϵ∂logpϵ(z)ϵ=0=∂ϵ∂log{1+ϵh(z)}ϵ=0=h(z). Therefore the Cramer-Rao lower bound for some PϵP_{\epsilon}Pϵ in the example one-dimensional submodel Pϵ\mathcal{P}_{\epsilon}Pϵ above is given by
ψ′(Pϵ)2varPϵ{sϵ(Z)}={∂∂ϵψ(Pϵ)∣ϵ=0}2EPϵ{h(Z)2}. \frac{\psi^{\prime}\left(P_{\epsilon}\right)^{2}}{\operatorname{var}_{P_{\epsilon}}\left\{s_{\epsilon}(Z)\right\}}=\frac{\left\{\left.\frac{\partial}{\partial \epsilon} \psi\left(P_{\epsilon}\right)\right|_{\epsilon=0}\right\}^{2}}{\mathbb{E}_{P_{\epsilon}}\left\{h(Z)^{2}\right\}}. varPϵ{sϵ(Z)}ψ′(Pϵ)2=EPϵ{h(Z)2}{∂ϵ∂ψ(Pϵ)ϵ=0}2. Comment: Why one-dimensional submodel? 详细的说明见 Michael Kosorok "Introduction to Empirical Processes and Semiparametric Inference" Chap. 18。
还需要说明的一点是为什么我们选择了pϵ(z)=dP(z){1+ϵh(z)}p_{\epsilon}(z)=d \mathbb{P}(z)\{1+\epsilon h(z)\}pϵ(z)=dP(z){1+ϵh(z)}作为submodel(以下内容改写自Mark van der Laan的 STAT C245B Survival Analysis and Causality 的课程材料)。
We want to define a type of differentiability of ψ:P→Rq\psi: \mathcal{P} \rightarrow \mathbb{R}^{q}ψ:P→Rq, where ψ\psiψ is the target parameter.
We could use the definition of a directional derivative in direction hhh :
dψ(P)(h)=ddϵψ(P+ϵh)∣ϵ=0 d \psi(\mathbb{P})(h)=\left.\frac{d}{d \epsilon} \psi(\mathbb{P}+\epsilon h)\right|_{\epsilon=0} dψ(P)(h)=dϵdψ(P+ϵh)ϵ=0 However, P+ϵh\mathbb{P}+\epsilon hP+ϵh might not be a path through P\mathcal{P}P, and thus ill defined. We need to define a derivative along paths that are submodels of P\mathcal{P}P.
Let P\mathcal{P}P be nonparametric. We define a class of paths such that:
pϵ(z)=dP(z){1+ϵh(z)} p_{\epsilon}(z)=d \mathbb{P}(z)\{1+\epsilon h(z)\} pϵ(z)=dP(z){1+ϵh(z)} Two key assumptions necessary for it to be a proper submodel are as follows:
- hhh is uniformly bounded
- EPh(z)=0\mathbb{E}_{P} h(z)=0EPh(z)=0
For ϵ∈(−δ,δ)\epsilon \in(-\delta, \delta)ϵ∈(−δ,δ) with δ=1∥h∥∞\delta=\frac{1}{\|h\|_{\infty}}δ=∥h∥∞1, this is a submodel.
To see why, first note that for the paths to be a proper density, we need:
- dP(z){1+ϵh(z)}⩾0d \mathbb{P}(z) \{1+\epsilon h(z)\} \geqslant 0dP(z){1+ϵh(z)}⩾0
Sketch proof:
Let h(z)h(z)h(z) be uniformly bounded and h(z)=∥h∥∞h(z)=\|h\|_{\infty}h(z)=∥h∥∞. If ϵ⩽∣δ∣,{1+ϵh(z)}⩾0\epsilon \leqslant|\delta|, \{1+\epsilon h(z)\} \geqslant 0ϵ⩽∣δ∣,{1+ϵh(z)}⩾0. Therefore, for ϵ\epsilonϵ sufficiently small and hhh uniformly bounded, dP(z){1+ϵh(z)}⩾0d \mathbb{P}(z) \{1+\epsilon h(z)\} \geqslant 0dP(z){1+ϵh(z)}⩾0.
- ∫{1+ϵh(z)}dP(z)=1\int \{1+\epsilon h(z)\} d \mathbb{P}(z)=1∫{1+ϵh(z)}dP(z)=1
Sketch proof:
Note that ∫{1+ϵh(z)}dP(z)=∫dP(z)+ϵ∫h(z)dP(z)=1\int\{1+\epsilon h(z)\} d \mathbb{P}(z)=\int d \mathbb{P}(z)+\epsilon \int h(z) d \mathbb{P}(z)=1∫{1+ϵh(z)}dP(z)=∫dP(z)+ϵ∫h(z)dP(z)=1 since ppp is a proper density and ∫h(z)dP(z)=EPh(z)=0\int h(z) d \mathbb{P}(z)=\mathbb{E}_{P} h(z)=0∫h(z)dP(z)=EPh(z)=0 by assumption.
Now consider the score of this submodel.
δδϵlogdPϵdP∣ϵ=0=δδϵlog{1+ϵh(z)}∣ϵ=0=h(z)1+ϵh(z)∣ϵ=0=h(z). \begin{aligned} \left.\frac{\delta}{\delta \epsilon} \log \frac{d P_{\epsilon}}{d \mathbb{P}}\right|_{\epsilon=0} & =\left.\frac{\delta}{\delta \epsilon} \log \{1+\epsilon h(z)\}\right|_{\epsilon=0} \\ & =\left.\frac{h(z)}{1+\epsilon h(z)}\right|_{\epsilon=0} \\ & =h(z). \end{aligned} δϵδlogdPdPϵϵ=0=δϵδlog{1+ϵh(z)}ϵ=0=1+ϵh(z)h(z)ϵ=0=h(z). "Since any lower bound for the submodel Pϵ\mathcal{P}_{\epsilon}Pϵ is also a lower bound for P\mathcal{P}P, the best and most informative is the greatest such lower bound. Can we say anything about the best such lower bound for generic functionals and/or submodels?"
2.2 Pathwise Differentiability
Recall the Cramer-Rao bound
{∂∂ϵψ(Pϵ)∣ϵ=0}2EPϵ{sϵ(Z)2} \frac{\left\{\left.\frac{\partial}{\partial \epsilon} \psi\left(P_{\epsilon}\right)\right|_{\epsilon=0}\right\}^{2}}{\mathbb{E}_{P_{\epsilon}}\left\{s_{\epsilon}(Z)^{2}\right\}} EPϵ{sϵ(Z)2}{∂ϵ∂ψ(Pϵ)ϵ=0}2 for submodel Pϵ\mathcal{P}_{\epsilon}Pϵ described in the previous subsection. To find the best such lower bound, we would like to optimize the above over all PϵP_{\epsilon}Pϵ in some submodels. It is not a priori clear how generally this can be accomplished, since different functionals ψ\psiψ could yield very different numerators. Therefore let us first consider what we can say about the derivative in the numerator, for a large class of pathwise differentiable functionals.
Namely, suppose the functional ψ:P↦R\psi: \mathcal{P} \mapsto \mathbb{R}ψ:P↦R is smooth, as a map from distributions to the reals, in the sense that it admits a kind of distributional Taylor expansion
ψ(Pˉ)−ψ(P)=∫φ(z;Pˉ)d(Pˉ−P)(z)+R2(Pˉ,P) \psi(\bar{P})-\psi(P)=\int \varphi(z ; \bar{P}) d(\bar{P}-P)(z)+R_{2}(\bar{P}, P) ψ(Pˉ)−ψ(P)=∫φ(z;Pˉ)d(Pˉ−P)(z)+R2(Pˉ,P) for distributions Pˉ\bar{P}Pˉ and PPP, often called a von Mises expansion, where φ(z;P)\varphi(z ; P)φ(z;P) is a mean-
zero, finite-variance function satisfying ∫φ(z;P)dP(z)=0\int \varphi(z ; P) d P(z)=0∫φ(z;P)dP(z)=0 and ∫φ(z;P)2dP(z)<∞\int \varphi(z ; P)^{2} d P(z)<\infty∫φ(z;P)2dP(z)<∞, and R2(Pˉ,P)R_{2}(\bar{P}, P)R2(Pˉ,P) is a second-order remainder term (which means it only depends on products or squares of differences between Pˉ\bar{P}Pˉ and P)P)P).
Intuitively, the von Mises expansion above is just an infinite-dimensional or distributional analog of a Taylor expansion, with φ(z;Q)\varphi(z ; Q)φ(z;Q) acting as a usual derivative term; it describes how the functional ψ\psiψ changes locally when the distribution changes from PPP to Pˉ\bar{P}Pˉ. For example, when Z∈{1,…,k}Z \in\{1, \ldots, k\}Z∈{1,…,k} is discrete and so Pˉ\bar{P}Pˉ and PPP have kkk countable components, the von Mises expansion reduces to a standard multivariate Taylor expansion with
R2(Pˉ,P)=ψ(pˉ1,…,pˉk)−ψ(p1,…,pk)−∑j∂∂tjψ(t1,…,tk)∣t=pˉ(pˉj−pj). R_{2}(\bar{P}, P)=\psi\left(\bar{p}_{1}, \ldots, \bar{p}_{k}\right)-\psi\left(p_{1}, \ldots, p_{k}\right)-\left.\sum_{j} \frac{\partial}{\partial t_{j}} \psi\left(t_{1}, \ldots, t_{k}\right)\right|_{t=\bar{p}}\left(\bar{p}_{j}-p_{j}\right). R2(Pˉ,P)=ψ(pˉ1,…,pˉk)−ψ(p1,…,pk)−j∑∂tj∂ψ(t1,…,tk)t=pˉ(pˉj−pj). -
今天就更到这里,去看碧蓝档案3周年fes了(
-
我爱你🤟波奇酱
-
@nomana 谢谢~
-
波奇酱好久没更了
-
@nomana 最近科研太忙了,等放假了再更
-
@hitori_bocchi 原来你也是档友!
-
@hashhash 阿罗娜可爱!
-
@hashhash 可以加贵校碧蓝档案群155199376聊天吹水