Introduction, Neural Networks Construction
All modern machine learning algorithms are just nearest neighbors. It’s only that the neural networks are telling you the space in which to compute the distance.
Risk: expected loss
\[R(\theta) = \mathbb{E}\left[\mathcal{L}(\theta, X, Y)\right]\]Empirical Risk: average loss
\[\hat{R}(\theta) = \frac{1}{N}\sum_{i=1}^N\mathcal{L}(\theta, x_i, y_i)\]Empirical Risk too high: underfitting; too low: overfitting.
其实就是用 Bayesian 的视角来看待 Regularization。
就是我们现在是要最大化
\[p(\theta \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \theta)\cdot p(\theta)}{p(\mathcal{D})}\]这东西叫 MAP 也就是 Maximum A Posteriori。
取对数之后,我们就可以得到
\[\begin{aligned} \arg\max_{\theta}\log p(\theta \mid \mathcal{D}) &= \arg\max_{\theta}\left\{\log p(\mathcal{D} \mid \theta) + \log p(\theta) - \log p(\mathcal{D})\right\} \\ &= \arg\max_{\theta}\left\{\log p(\mathcal{D} \mid \theta) + \log p(\theta)\right\} \end{aligned}\]而 \(\log p(\mathcal{D} \mid \theta)\) 其实就是 loss 嘛。那 \(\log p(\theta)\) 其实就是 regularization。
就比如说我们现在假设 \(\theta\sim\mathcal{N}(0, \sigma^2I)\),那么我们就可以得到
\[\log p(\theta) = -\frac{1}{2}\sum_{i=1}^n\left(\frac{\theta_i}{\sigma}\right)^2\log\left(\frac{1}{\sigma\sqrt{2\pi}}\right) = \lambda\Vert\theta\Vert^2\]其中 \(\lambda\) 是一个常数。
其中 \(\phi\) 叫 bias functions。一般 \(\phi_0(x)=1\)。
然后其实我们认为真实值 \(y\) 是
\[y = \Phi w + \epsilon, \quad \epsilon_i \sim \mathcal{N}(0, \sigma^2)\]然后根据 loss function
\[\mathcal{L}(w) = \frac{1}{2}\left(y - \Phi w\right)^{\top}\left(y - \Phi w\right) + \frac{\lambda}{2}w^{\top}w\]容易得到
\[\hat{w} = (\Phi^{\top}\Phi + \lambda I)^{-1}\Phi^{\top}y\]这边简单复习一下上学期学的东西。假设没有 regularization,那么我们就有
\[\hat{w} = (\Phi^{\top}\Phi)^{-1}\Phi^{\top}y\]然后我们令
\[H = \Phi(\Phi^{\top}\Phi)^{-1}\Phi^{\top}\]那么 \(H\) 和 \(I - H\) 是 projection matrix,并且 \(Hy \perp (I - H)y\):
\[\begin{cases} H\Phi = \Phi \\ H^2 = H \\ (I - H)^2 = I - H \\ H^{\top} = H \\ (I - H)^{\top} = I - H \end{cases}\]误差分析的话,其实就考虑的是在那个 \(\Phi\) 的平面上,加上一个正态分布的 \(\epsilon\) 嘛。\(\Phi\) 这个平面是 \(M\) 维的,那么与之正交的那个东西,也就是 \(I - H\) 肯定是 \(N - M\) 维的。现在我们要求的是 \(\epsilon\) 投影在这个平面上的长度,那很显然是 \(N - M\) 个 \(\mathcal{N}(0, \sigma^2)\) 相加,也就是
\[\frac{1}{\sigma^2}(y - \hat{y})^{\top}(y - \hat{y}) \sim \chi^2_{N-M}\]所以说
\[\mathbb{E}\left[(y - \hat{y})^{\top}(y - \hat{y})\right] = \sigma^2(N - M)\]也就是
\[s^2 = \frac{1}{N-M}\sum_{i=1}^N(y_i - \hat{y}_i)^2\Longrightarrow\mathbb{E}[s^2] = \sigma^2\]祭出 ANOVA Table (\(A = \frac{1}{n}11^{\top}\)):
Source | DF | Sum of Squares | Mean Square | \(\mathcal{F}\)-value |
---|---|---|---|---|
Regression | \(M - 1\) | \(\mathrm{SSR}=\sum_{i=1}^N(\overline{y}_i - \hat{y}_i)^2 = y^{\top}(H - A)y\) | \(\mathrm{MSR}=\frac{\mathrm{SSR}}{M-1}\) | \(F=\frac{\mathrm{MSR}}{s^2}\) |
Residual | \(N - M\) | \(\mathrm{SSE}=\sum_{i=1}^N(y_i - \hat{y}_i)^2 = y^{\top}(I - H)y\) | \(s^2=\frac{\mathrm{SSE}}{N-M}\) | |
Total | \(N - 1\) | \(\mathcal{S}_{yy}=\sum_{i=1}^N(\hat{y}_i - \overline{y}_i)^2 = y^{\top}(I - A)y\) |
考虑到 \(\mathrm{SSR}\) 和 \(\mathrm{SSE}\) 都是 \(\chi^2\) distribution,他们除一除的 \(F\) 就是一个 \(\mathcal{F}\) distribution 嘛。
所以说
\[F = \frac{\mathrm{MSR}}{s^2}\sim \mathcal{F}_{M - 1, N - M}\] \[\mathbb{E}[L] = \mathbb{E}\left[\frac{1}{2}\left(f(x) - y\right)^2\right]\] \[F(y(x) + \epsilon \eta(x)) = F(y(x)) + \int \frac{\delta F}{\delta y(x)}\eta(x)\mathrm{d}x + \mathcal{O}(\epsilon^2)\] \[\begin{aligned} \frac{\delta\mathbb{E}\left[L\right]}{\delta f(x)}&=\frac{\delta}{\delta f(x)}\mathbb{E}\left[\frac{1}{2}\left(f(x) - y\right)^2\right]\\ &=\frac{\delta}{\delta f(x)}\iint \frac{1}{2}(f(x) - y)^2 p(x, y)\mathrm{d}x\mathrm{d}y \end{aligned}\]